diff mbox

[bpf-next,v8,05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

Message ID 20180227004121.3633-6-mic@digikod.net (mailing list archive)
State New, archived
Headers show

Commit Message

Mickaël Salaün Feb. 27, 2018, 12:41 a.m. UTC
The seccomp(2) syscall can be used by a task to apply a Landlock program
to itself. As a seccomp filter, a Landlock program is enforced for the
current task and all its future children. A program is immutable and a
task can only add new restricting programs to itself, forming a list of
programss.

A Landlock program is tied to a Landlock hook. If the action on a kernel
object is allowed by the other Linux security mechanisms (e.g. DAC,
capabilities, other LSM), then a Landlock hook related to this kind of
object is triggered. The list of programs for this hook is then
evaluated. Each program return a 32-bit value which can deny the action
on a kernel object with a non-zero value. If every programs of the list
return zero, then the action on the object is allowed.

Multiple Landlock programs can be chained to share a 64-bits value for a
call chain (e.g. evaluating multiple elements of a file path).  This
chaining is restricted when a process construct this chain by loading a
program, but additional checks are performed when it requests to apply
this chain of programs to itself.  The restrictions ensure that it is
not possible to call multiple programs in a way that would imply to
handle multiple shared values (i.e. cookies) for one chain.  For now,
only a fs_pick program can be chained to the same type of program,
because it may make sense if they have different triggers (cf. next
commits).  This restrictions still allows to reuse Landlock programs in
a safe way (e.g. use the same loaded fs_walk program with multiple
chains of fs_pick programs).

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Will Drewry <wad@chromium.org>
Link: https://lkml.kernel.org/r/c10a503d-5e35-7785-2f3d-25ed8dd63fab@digikod.net
---

Changes since v7:
* handle and verify program chains
* split and rename providers.c to enforce.c and enforce_seccomp.c
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*

Changes since v6:
* rename some functions with more accurate names to reflect that an eBPF
  program for Landlock could be used for something else than a rule
* reword rule "appending" to "prepending" and explain it
* remove the superfluous no_new_privs check, only check global
  CAP_SYS_ADMIN when prepending a Landlock rule (needed for containers)
* create and use {get,put}_seccomp_landlock() (suggested by Kees Cook)
* replace ifdef with static inlined function (suggested by Kees Cook)
* use get_user() (suggested by Kees Cook)
* replace atomic_t with refcount_t (requested by Kees Cook)
* move struct landlock_{rule,events} from landlock.h to common.h
* cleanup headers

Changes since v5:
* remove struct landlock_node and use a similar inheritance mechanisme
  as seccomp-bpf (requested by Andy Lutomirski)
* rename SECCOMP_ADD_LANDLOCK_RULE to SECCOMP_APPEND_LANDLOCK_RULE
* rename file manager.c to providers.c
* add comments
* typo and cosmetic fixes

Changes since v4:
* merge manager and seccomp patches
* return -EFAULT in seccomp(2) when user_bpf_fd is null to easely check
  if Landlock is supported
* only allow a process with the global CAP_SYS_ADMIN to use Landlock
  (will be lifted in the future)
* add an early check to exit as soon as possible if the current process
  does not have Landlock rules

Changes since v3:
* remove the hard link with seccomp (suggested by Andy Lutomirski and
  Kees Cook):
  * remove the cookie which could imply multiple evaluation of Landlock
    rules
  * remove the origin field in struct landlock_data
* remove documentation fix (merged upstream)
* rename the new seccomp command to SECCOMP_ADD_LANDLOCK_RULE
* internal renaming
* split commit
* new design to be able to inherit on the fly the parent rules

Changes since v2:
* Landlock programs can now be run without seccomp filter but for any
  syscall (from the process) or interruption
* move Landlock related functions and structs into security/landlock/*
  (to manage cgroups as well)
* fix seccomp filter handling: run Landlock programs for each of their
  legitimate seccomp filter
* properly clean up all seccomp results
* cosmetic changes to ease the understanding
* fix some ifdef
---
 include/linux/landlock.h            |  37 ++++
 include/linux/seccomp.h             |   5 +
 include/uapi/linux/seccomp.h        |   1 +
 kernel/fork.c                       |   8 +-
 kernel/seccomp.c                    |   4 +
 security/landlock/Makefile          |   3 +-
 security/landlock/chain.c           |  39 ++++
 security/landlock/chain.h           |  35 ++++
 security/landlock/common.h          |  53 +++++
 security/landlock/enforce.c         | 386 ++++++++++++++++++++++++++++++++++++
 security/landlock/enforce.h         |  21 ++
 security/landlock/enforce_seccomp.c | 102 ++++++++++
 12 files changed, 692 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/landlock.h
 create mode 100644 security/landlock/chain.c
 create mode 100644 security/landlock/chain.h
 create mode 100644 security/landlock/enforce.c
 create mode 100644 security/landlock/enforce.h
 create mode 100644 security/landlock/enforce_seccomp.c

Comments

Alexei Starovoitov Feb. 27, 2018, 2:08 a.m. UTC | #1
On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
> The seccomp(2) syscall can be used by a task to apply a Landlock program
> to itself. As a seccomp filter, a Landlock program is enforced for the
> current task and all its future children. A program is immutable and a
> task can only add new restricting programs to itself, forming a list of
> programss.
> 
> A Landlock program is tied to a Landlock hook. If the action on a kernel
> object is allowed by the other Linux security mechanisms (e.g. DAC,
> capabilities, other LSM), then a Landlock hook related to this kind of
> object is triggered. The list of programs for this hook is then
> evaluated. Each program return a 32-bit value which can deny the action
> on a kernel object with a non-zero value. If every programs of the list
> return zero, then the action on the object is allowed.
> 
> Multiple Landlock programs can be chained to share a 64-bits value for a
> call chain (e.g. evaluating multiple elements of a file path).  This
> chaining is restricted when a process construct this chain by loading a
> program, but additional checks are performed when it requests to apply
> this chain of programs to itself.  The restrictions ensure that it is
> not possible to call multiple programs in a way that would imply to
> handle multiple shared values (i.e. cookies) for one chain.  For now,
> only a fs_pick program can be chained to the same type of program,
> because it may make sense if they have different triggers (cf. next
> commits).  This restrictions still allows to reuse Landlock programs in
> a safe way (e.g. use the same loaded fs_walk program with multiple
> chains of fs_pick programs).
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>

...

> +struct landlock_prog_set *landlock_prepend_prog(
> +		struct landlock_prog_set *current_prog_set,
> +		struct bpf_prog *prog)
> +{
> +	struct landlock_prog_set *new_prog_set = current_prog_set;
> +	unsigned long pages;
> +	int err;
> +	size_t i;
> +	struct landlock_prog_set tmp_prog_set = {};
> +
> +	if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* validate memory size allocation */
> +	pages = prog->pages;
> +	if (current_prog_set) {
> +		size_t i;
> +
> +		for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
> +			struct landlock_prog_list *walker_p;
> +
> +			for (walker_p = current_prog_set->programs[i];
> +					walker_p; walker_p = walker_p->prev)
> +				pages += walker_p->prog->pages;
> +		}
> +		/* count a struct landlock_prog_set if we need to allocate one */
> +		if (refcount_read(&current_prog_set->usage) != 1)
> +			pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
> +				/ PAGE_SIZE;
> +	}
> +	if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
> +		return ERR_PTR(-E2BIG);
> +
> +	/* ensure early that we can allocate enough memory for the new
> +	 * prog_lists */
> +	err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
> +	if (err)
> +		return ERR_PTR(err);
> +
> +	/*
> +	 * Each task_struct points to an array of prog list pointers.  These
> +	 * tables are duplicated when additions are made (which means each
> +	 * table needs to be refcounted for the processes using it). When a new
> +	 * table is created, all the refcounters on the prog_list are bumped (to
> +	 * track each table that references the prog). When a new prog is
> +	 * added, it's just prepended to the list for the new table to point
> +	 * at.
> +	 *
> +	 * Manage all the possible errors before this step to not uselessly
> +	 * duplicate current_prog_set and avoid a rollback.
> +	 */
> +	if (!new_prog_set) {
> +		/*
> +		 * If there is no Landlock program set used by the current task,
> +		 * then create a new one.
> +		 */
> +		new_prog_set = new_landlock_prog_set();
> +		if (IS_ERR(new_prog_set))
> +			goto put_tmp_lists;
> +	} else if (refcount_read(&current_prog_set->usage) > 1) {
> +		/*
> +		 * If the current task is not the sole user of its Landlock
> +		 * program set, then duplicate them.
> +		 */
> +		new_prog_set = new_landlock_prog_set();
> +		if (IS_ERR(new_prog_set))
> +			goto put_tmp_lists;
> +		for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
> +			new_prog_set->programs[i] =
> +				READ_ONCE(current_prog_set->programs[i]);
> +			if (new_prog_set->programs[i])
> +				refcount_inc(&new_prog_set->programs[i]->usage);
> +		}
> +
> +		/*
> +		 * Landlock program set from the current task will not be freed
> +		 * here because the usage is strictly greater than 1. It is
> +		 * only prevented to be freed by another task thanks to the
> +		 * caller of landlock_prepend_prog() which should be locked if
> +		 * needed.
> +		 */
> +		landlock_put_prog_set(current_prog_set);
> +	}
> +
> +	/* prepend tmp_prog_set to new_prog_set */
> +	for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
> +		/* get the last new list */
> +		struct landlock_prog_list *last_list =
> +			tmp_prog_set.programs[i];
> +
> +		if (last_list) {
> +			while (last_list->prev)
> +				last_list = last_list->prev;
> +			/* no need to increment usage (pointer replacement) */
> +			last_list->prev = new_prog_set->programs[i];
> +			new_prog_set->programs[i] = tmp_prog_set.programs[i];
> +		}
> +	}
> +	new_prog_set->chain_last = tmp_prog_set.chain_last;
> +	return new_prog_set;
> +
> +put_tmp_lists:
> +	for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
> +		put_landlock_prog_list(tmp_prog_set.programs[i]);
> +	return new_prog_set;
> +}

Nack on the chaining concept.
Please do not reinvent the wheel.
There is an existing mechanism for attaching/detaching/quering multiple
programs attached to cgroup and tracing hooks that are also
efficiently executed via BPF_PROG_RUN_ARRAY.
Please use that instead.
Andy Lutomirski Feb. 27, 2018, 4:40 a.m. UTC | #2
On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>> to itself. As a seccomp filter, a Landlock program is enforced for the
>> current task and all its future children. A program is immutable and a
>> task can only add new restricting programs to itself, forming a list of
>> programss.
>>
>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>> capabilities, other LSM), then a Landlock hook related to this kind of
>> object is triggered. The list of programs for this hook is then
>> evaluated. Each program return a 32-bit value which can deny the action
>> on a kernel object with a non-zero value. If every programs of the list
>> return zero, then the action on the object is allowed.
>>
>> Multiple Landlock programs can be chained to share a 64-bits value for a
>> call chain (e.g. evaluating multiple elements of a file path).  This
>> chaining is restricted when a process construct this chain by loading a
>> program, but additional checks are performed when it requests to apply
>> this chain of programs to itself.  The restrictions ensure that it is
>> not possible to call multiple programs in a way that would imply to
>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>> only a fs_pick program can be chained to the same type of program,
>> because it may make sense if they have different triggers (cf. next
>> commits).  This restrictions still allows to reuse Landlock programs in
>> a safe way (e.g. use the same loaded fs_walk program with multiple
>> chains of fs_pick programs).
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>
> ...
>
>> +struct landlock_prog_set *landlock_prepend_prog(
>> +             struct landlock_prog_set *current_prog_set,
>> +             struct bpf_prog *prog)
>> +{
>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>> +     unsigned long pages;
>> +     int err;
>> +     size_t i;
>> +     struct landlock_prog_set tmp_prog_set = {};
>> +
>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     /* validate memory size allocation */
>> +     pages = prog->pages;
>> +     if (current_prog_set) {
>> +             size_t i;
>> +
>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>> +                     struct landlock_prog_list *walker_p;
>> +
>> +                     for (walker_p = current_prog_set->programs[i];
>> +                                     walker_p; walker_p = walker_p->prev)
>> +                             pages += walker_p->prog->pages;
>> +             }
>> +             /* count a struct landlock_prog_set if we need to allocate one */
>> +             if (refcount_read(&current_prog_set->usage) != 1)
>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>> +                             / PAGE_SIZE;
>> +     }
>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>> +             return ERR_PTR(-E2BIG);
>> +
>> +     /* ensure early that we can allocate enough memory for the new
>> +      * prog_lists */
>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>> +     if (err)
>> +             return ERR_PTR(err);
>> +
>> +     /*
>> +      * Each task_struct points to an array of prog list pointers.  These
>> +      * tables are duplicated when additions are made (which means each
>> +      * table needs to be refcounted for the processes using it). When a new
>> +      * table is created, all the refcounters on the prog_list are bumped (to
>> +      * track each table that references the prog). When a new prog is
>> +      * added, it's just prepended to the list for the new table to point
>> +      * at.
>> +      *
>> +      * Manage all the possible errors before this step to not uselessly
>> +      * duplicate current_prog_set and avoid a rollback.
>> +      */
>> +     if (!new_prog_set) {
>> +             /*
>> +              * If there is no Landlock program set used by the current task,
>> +              * then create a new one.
>> +              */
>> +             new_prog_set = new_landlock_prog_set();
>> +             if (IS_ERR(new_prog_set))
>> +                     goto put_tmp_lists;
>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>> +             /*
>> +              * If the current task is not the sole user of its Landlock
>> +              * program set, then duplicate them.
>> +              */
>> +             new_prog_set = new_landlock_prog_set();
>> +             if (IS_ERR(new_prog_set))
>> +                     goto put_tmp_lists;
>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>> +                     new_prog_set->programs[i] =
>> +                             READ_ONCE(current_prog_set->programs[i]);
>> +                     if (new_prog_set->programs[i])
>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>> +             }
>> +
>> +             /*
>> +              * Landlock program set from the current task will not be freed
>> +              * here because the usage is strictly greater than 1. It is
>> +              * only prevented to be freed by another task thanks to the
>> +              * caller of landlock_prepend_prog() which should be locked if
>> +              * needed.
>> +              */
>> +             landlock_put_prog_set(current_prog_set);
>> +     }
>> +
>> +     /* prepend tmp_prog_set to new_prog_set */
>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>> +             /* get the last new list */
>> +             struct landlock_prog_list *last_list =
>> +                     tmp_prog_set.programs[i];
>> +
>> +             if (last_list) {
>> +                     while (last_list->prev)
>> +                             last_list = last_list->prev;
>> +                     /* no need to increment usage (pointer replacement) */
>> +                     last_list->prev = new_prog_set->programs[i];
>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>> +             }
>> +     }
>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>> +     return new_prog_set;
>> +
>> +put_tmp_lists:
>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>> +     return new_prog_set;
>> +}
>
> Nack on the chaining concept.
> Please do not reinvent the wheel.
> There is an existing mechanism for attaching/detaching/quering multiple
> programs attached to cgroup and tracing hooks that are also
> efficiently executed via BPF_PROG_RUN_ARRAY.
> Please use that instead.
>

I don't see how that would help.  Suppose you add a filter, then
fork(), and then the child adds another filter.  Do you want to
duplicate the entire array?  You certainly can't *modify* the array
because you'll affect processes that shouldn't be affected.

In contrast, doing this through seccomp like the earlier patches
seemed just fine to me, and seccomp already had the right logic.
Alexei Starovoitov Feb. 27, 2018, 4:54 a.m. UTC | #3
On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
> >> The seccomp(2) syscall can be used by a task to apply a Landlock program
> >> to itself. As a seccomp filter, a Landlock program is enforced for the
> >> current task and all its future children. A program is immutable and a
> >> task can only add new restricting programs to itself, forming a list of
> >> programss.
> >>
> >> A Landlock program is tied to a Landlock hook. If the action on a kernel
> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
> >> capabilities, other LSM), then a Landlock hook related to this kind of
> >> object is triggered. The list of programs for this hook is then
> >> evaluated. Each program return a 32-bit value which can deny the action
> >> on a kernel object with a non-zero value. If every programs of the list
> >> return zero, then the action on the object is allowed.
> >>
> >> Multiple Landlock programs can be chained to share a 64-bits value for a
> >> call chain (e.g. evaluating multiple elements of a file path).  This
> >> chaining is restricted when a process construct this chain by loading a
> >> program, but additional checks are performed when it requests to apply
> >> this chain of programs to itself.  The restrictions ensure that it is
> >> not possible to call multiple programs in a way that would imply to
> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
> >> only a fs_pick program can be chained to the same type of program,
> >> because it may make sense if they have different triggers (cf. next
> >> commits).  This restrictions still allows to reuse Landlock programs in
> >> a safe way (e.g. use the same loaded fs_walk program with multiple
> >> chains of fs_pick programs).
> >>
> >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> >
> > ...
> >
> >> +struct landlock_prog_set *landlock_prepend_prog(
> >> +             struct landlock_prog_set *current_prog_set,
> >> +             struct bpf_prog *prog)
> >> +{
> >> +     struct landlock_prog_set *new_prog_set = current_prog_set;
> >> +     unsigned long pages;
> >> +     int err;
> >> +     size_t i;
> >> +     struct landlock_prog_set tmp_prog_set = {};
> >> +
> >> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
> >> +             return ERR_PTR(-EINVAL);
> >> +
> >> +     /* validate memory size allocation */
> >> +     pages = prog->pages;
> >> +     if (current_prog_set) {
> >> +             size_t i;
> >> +
> >> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
> >> +                     struct landlock_prog_list *walker_p;
> >> +
> >> +                     for (walker_p = current_prog_set->programs[i];
> >> +                                     walker_p; walker_p = walker_p->prev)
> >> +                             pages += walker_p->prog->pages;
> >> +             }
> >> +             /* count a struct landlock_prog_set if we need to allocate one */
> >> +             if (refcount_read(&current_prog_set->usage) != 1)
> >> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
> >> +                             / PAGE_SIZE;
> >> +     }
> >> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
> >> +             return ERR_PTR(-E2BIG);
> >> +
> >> +     /* ensure early that we can allocate enough memory for the new
> >> +      * prog_lists */
> >> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
> >> +     if (err)
> >> +             return ERR_PTR(err);
> >> +
> >> +     /*
> >> +      * Each task_struct points to an array of prog list pointers.  These
> >> +      * tables are duplicated when additions are made (which means each
> >> +      * table needs to be refcounted for the processes using it). When a new
> >> +      * table is created, all the refcounters on the prog_list are bumped (to
> >> +      * track each table that references the prog). When a new prog is
> >> +      * added, it's just prepended to the list for the new table to point
> >> +      * at.
> >> +      *
> >> +      * Manage all the possible errors before this step to not uselessly
> >> +      * duplicate current_prog_set and avoid a rollback.
> >> +      */
> >> +     if (!new_prog_set) {
> >> +             /*
> >> +              * If there is no Landlock program set used by the current task,
> >> +              * then create a new one.
> >> +              */
> >> +             new_prog_set = new_landlock_prog_set();
> >> +             if (IS_ERR(new_prog_set))
> >> +                     goto put_tmp_lists;
> >> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
> >> +             /*
> >> +              * If the current task is not the sole user of its Landlock
> >> +              * program set, then duplicate them.
> >> +              */
> >> +             new_prog_set = new_landlock_prog_set();
> >> +             if (IS_ERR(new_prog_set))
> >> +                     goto put_tmp_lists;
> >> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
> >> +                     new_prog_set->programs[i] =
> >> +                             READ_ONCE(current_prog_set->programs[i]);
> >> +                     if (new_prog_set->programs[i])
> >> +                             refcount_inc(&new_prog_set->programs[i]->usage);
> >> +             }
> >> +
> >> +             /*
> >> +              * Landlock program set from the current task will not be freed
> >> +              * here because the usage is strictly greater than 1. It is
> >> +              * only prevented to be freed by another task thanks to the
> >> +              * caller of landlock_prepend_prog() which should be locked if
> >> +              * needed.
> >> +              */
> >> +             landlock_put_prog_set(current_prog_set);
> >> +     }
> >> +
> >> +     /* prepend tmp_prog_set to new_prog_set */
> >> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
> >> +             /* get the last new list */
> >> +             struct landlock_prog_list *last_list =
> >> +                     tmp_prog_set.programs[i];
> >> +
> >> +             if (last_list) {
> >> +                     while (last_list->prev)
> >> +                             last_list = last_list->prev;
> >> +                     /* no need to increment usage (pointer replacement) */
> >> +                     last_list->prev = new_prog_set->programs[i];
> >> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
> >> +             }
> >> +     }
> >> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
> >> +     return new_prog_set;
> >> +
> >> +put_tmp_lists:
> >> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
> >> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
> >> +     return new_prog_set;
> >> +}
> >
> > Nack on the chaining concept.
> > Please do not reinvent the wheel.
> > There is an existing mechanism for attaching/detaching/quering multiple
> > programs attached to cgroup and tracing hooks that are also
> > efficiently executed via BPF_PROG_RUN_ARRAY.
> > Please use that instead.
> >
> 
> I don't see how that would help.  Suppose you add a filter, then
> fork(), and then the child adds another filter.  Do you want to
> duplicate the entire array?  You certainly can't *modify* the array
> because you'll affect processes that shouldn't be affected.
> 
> In contrast, doing this through seccomp like the earlier patches
> seemed just fine to me, and seccomp already had the right logic.

it doesn't look to me that existing seccomp side of managing fork
situation can be reused. Here there is an attempt to add 'chaining'
concept which sort of an extension of existing seccomp style,
but somehow heavily done on bpf side and contradicts cgroup/tracing.
Andy Lutomirski Feb. 27, 2018, 5:20 a.m. UTC | #4
On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>> >> The seccomp(2) syscall can be used by a task to apply a Landlock program
>> >> to itself. As a seccomp filter, a Landlock program is enforced for the
>> >> current task and all its future children. A program is immutable and a
>> >> task can only add new restricting programs to itself, forming a list of
>> >> programss.
>> >>
>> >> A Landlock program is tied to a Landlock hook. If the action on a kernel
>> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
>> >> capabilities, other LSM), then a Landlock hook related to this kind of
>> >> object is triggered. The list of programs for this hook is then
>> >> evaluated. Each program return a 32-bit value which can deny the action
>> >> on a kernel object with a non-zero value. If every programs of the list
>> >> return zero, then the action on the object is allowed.
>> >>
>> >> Multiple Landlock programs can be chained to share a 64-bits value for a
>> >> call chain (e.g. evaluating multiple elements of a file path).  This
>> >> chaining is restricted when a process construct this chain by loading a
>> >> program, but additional checks are performed when it requests to apply
>> >> this chain of programs to itself.  The restrictions ensure that it is
>> >> not possible to call multiple programs in a way that would imply to
>> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
>> >> only a fs_pick program can be chained to the same type of program,
>> >> because it may make sense if they have different triggers (cf. next
>> >> commits).  This restrictions still allows to reuse Landlock programs in
>> >> a safe way (e.g. use the same loaded fs_walk program with multiple
>> >> chains of fs_pick programs).
>> >>
>> >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> >
>> > ...
>> >
>> >> +struct landlock_prog_set *landlock_prepend_prog(
>> >> +             struct landlock_prog_set *current_prog_set,
>> >> +             struct bpf_prog *prog)
>> >> +{
>> >> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>> >> +     unsigned long pages;
>> >> +     int err;
>> >> +     size_t i;
>> >> +     struct landlock_prog_set tmp_prog_set = {};
>> >> +
>> >> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>> >> +             return ERR_PTR(-EINVAL);
>> >> +
>> >> +     /* validate memory size allocation */
>> >> +     pages = prog->pages;
>> >> +     if (current_prog_set) {
>> >> +             size_t i;
>> >> +
>> >> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>> >> +                     struct landlock_prog_list *walker_p;
>> >> +
>> >> +                     for (walker_p = current_prog_set->programs[i];
>> >> +                                     walker_p; walker_p = walker_p->prev)
>> >> +                             pages += walker_p->prog->pages;
>> >> +             }
>> >> +             /* count a struct landlock_prog_set if we need to allocate one */
>> >> +             if (refcount_read(&current_prog_set->usage) != 1)
>> >> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>> >> +                             / PAGE_SIZE;
>> >> +     }
>> >> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>> >> +             return ERR_PTR(-E2BIG);
>> >> +
>> >> +     /* ensure early that we can allocate enough memory for the new
>> >> +      * prog_lists */
>> >> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>> >> +     if (err)
>> >> +             return ERR_PTR(err);
>> >> +
>> >> +     /*
>> >> +      * Each task_struct points to an array of prog list pointers.  These
>> >> +      * tables are duplicated when additions are made (which means each
>> >> +      * table needs to be refcounted for the processes using it). When a new
>> >> +      * table is created, all the refcounters on the prog_list are bumped (to
>> >> +      * track each table that references the prog). When a new prog is
>> >> +      * added, it's just prepended to the list for the new table to point
>> >> +      * at.
>> >> +      *
>> >> +      * Manage all the possible errors before this step to not uselessly
>> >> +      * duplicate current_prog_set and avoid a rollback.
>> >> +      */
>> >> +     if (!new_prog_set) {
>> >> +             /*
>> >> +              * If there is no Landlock program set used by the current task,
>> >> +              * then create a new one.
>> >> +              */
>> >> +             new_prog_set = new_landlock_prog_set();
>> >> +             if (IS_ERR(new_prog_set))
>> >> +                     goto put_tmp_lists;
>> >> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>> >> +             /*
>> >> +              * If the current task is not the sole user of its Landlock
>> >> +              * program set, then duplicate them.
>> >> +              */
>> >> +             new_prog_set = new_landlock_prog_set();
>> >> +             if (IS_ERR(new_prog_set))
>> >> +                     goto put_tmp_lists;
>> >> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>> >> +                     new_prog_set->programs[i] =
>> >> +                             READ_ONCE(current_prog_set->programs[i]);
>> >> +                     if (new_prog_set->programs[i])
>> >> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>> >> +             }
>> >> +
>> >> +             /*
>> >> +              * Landlock program set from the current task will not be freed
>> >> +              * here because the usage is strictly greater than 1. It is
>> >> +              * only prevented to be freed by another task thanks to the
>> >> +              * caller of landlock_prepend_prog() which should be locked if
>> >> +              * needed.
>> >> +              */
>> >> +             landlock_put_prog_set(current_prog_set);
>> >> +     }
>> >> +
>> >> +     /* prepend tmp_prog_set to new_prog_set */
>> >> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>> >> +             /* get the last new list */
>> >> +             struct landlock_prog_list *last_list =
>> >> +                     tmp_prog_set.programs[i];
>> >> +
>> >> +             if (last_list) {
>> >> +                     while (last_list->prev)
>> >> +                             last_list = last_list->prev;
>> >> +                     /* no need to increment usage (pointer replacement) */
>> >> +                     last_list->prev = new_prog_set->programs[i];
>> >> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>> >> +             }
>> >> +     }
>> >> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>> >> +     return new_prog_set;
>> >> +
>> >> +put_tmp_lists:
>> >> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>> >> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>> >> +     return new_prog_set;
>> >> +}
>> >
>> > Nack on the chaining concept.
>> > Please do not reinvent the wheel.
>> > There is an existing mechanism for attaching/detaching/quering multiple
>> > programs attached to cgroup and tracing hooks that are also
>> > efficiently executed via BPF_PROG_RUN_ARRAY.
>> > Please use that instead.
>> >
>>
>> I don't see how that would help.  Suppose you add a filter, then
>> fork(), and then the child adds another filter.  Do you want to
>> duplicate the entire array?  You certainly can't *modify* the array
>> because you'll affect processes that shouldn't be affected.
>>
>> In contrast, doing this through seccomp like the earlier patches
>> seemed just fine to me, and seccomp already had the right logic.
>
> it doesn't look to me that existing seccomp side of managing fork
> situation can be reused. Here there is an attempt to add 'chaining'
> concept which sort of an extension of existing seccomp style,
> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>

I don't see why the seccomp way can't be used.  I agree with you that
the seccomp *style* shouldn't be used in bpf code like this, but I
think that Landlock programs can and should just live in the existing
seccomp chain.  If the existing seccomp code needs some modification
to make this work, then so be it.

In other words, the kernel already has two kinds of chaining:
seccomp's and bpf's.  bpf's doesn't work right for this type of usage
across fork(), whereas seccomp's already handles that case correctly.
(In contrast, seccomp's is totally wrong for cgroup-attached filters.)
 So IMO Landlock should use the seccomp core code and call into bpf
for the actual filtering.

For what it's worth, I haven't figured out that Mickaël means about
chaining restrictions involving cookies.  This seems like something
wrong and the design is too complicated.
Alexei Starovoitov Feb. 27, 2018, 5:32 a.m. UTC | #5
On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
> >> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
> >> >> The seccomp(2) syscall can be used by a task to apply a Landlock program
> >> >> to itself. As a seccomp filter, a Landlock program is enforced for the
> >> >> current task and all its future children. A program is immutable and a
> >> >> task can only add new restricting programs to itself, forming a list of
> >> >> programss.
> >> >>
> >> >> A Landlock program is tied to a Landlock hook. If the action on a kernel
> >> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
> >> >> capabilities, other LSM), then a Landlock hook related to this kind of
> >> >> object is triggered. The list of programs for this hook is then
> >> >> evaluated. Each program return a 32-bit value which can deny the action
> >> >> on a kernel object with a non-zero value. If every programs of the list
> >> >> return zero, then the action on the object is allowed.
> >> >>
> >> >> Multiple Landlock programs can be chained to share a 64-bits value for a
> >> >> call chain (e.g. evaluating multiple elements of a file path).  This
> >> >> chaining is restricted when a process construct this chain by loading a
> >> >> program, but additional checks are performed when it requests to apply
> >> >> this chain of programs to itself.  The restrictions ensure that it is
> >> >> not possible to call multiple programs in a way that would imply to
> >> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
> >> >> only a fs_pick program can be chained to the same type of program,
> >> >> because it may make sense if they have different triggers (cf. next
> >> >> commits).  This restrictions still allows to reuse Landlock programs in
> >> >> a safe way (e.g. use the same loaded fs_walk program with multiple
> >> >> chains of fs_pick programs).
> >> >>
> >> >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> >> >
> >> > ...
> >> >
> >> >> +struct landlock_prog_set *landlock_prepend_prog(
> >> >> +             struct landlock_prog_set *current_prog_set,
> >> >> +             struct bpf_prog *prog)
> >> >> +{
> >> >> +     struct landlock_prog_set *new_prog_set = current_prog_set;
> >> >> +     unsigned long pages;
> >> >> +     int err;
> >> >> +     size_t i;
> >> >> +     struct landlock_prog_set tmp_prog_set = {};
> >> >> +
> >> >> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
> >> >> +             return ERR_PTR(-EINVAL);
> >> >> +
> >> >> +     /* validate memory size allocation */
> >> >> +     pages = prog->pages;
> >> >> +     if (current_prog_set) {
> >> >> +             size_t i;
> >> >> +
> >> >> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
> >> >> +                     struct landlock_prog_list *walker_p;
> >> >> +
> >> >> +                     for (walker_p = current_prog_set->programs[i];
> >> >> +                                     walker_p; walker_p = walker_p->prev)
> >> >> +                             pages += walker_p->prog->pages;
> >> >> +             }
> >> >> +             /* count a struct landlock_prog_set if we need to allocate one */
> >> >> +             if (refcount_read(&current_prog_set->usage) != 1)
> >> >> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
> >> >> +                             / PAGE_SIZE;
> >> >> +     }
> >> >> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
> >> >> +             return ERR_PTR(-E2BIG);
> >> >> +
> >> >> +     /* ensure early that we can allocate enough memory for the new
> >> >> +      * prog_lists */
> >> >> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
> >> >> +     if (err)
> >> >> +             return ERR_PTR(err);
> >> >> +
> >> >> +     /*
> >> >> +      * Each task_struct points to an array of prog list pointers.  These
> >> >> +      * tables are duplicated when additions are made (which means each
> >> >> +      * table needs to be refcounted for the processes using it). When a new
> >> >> +      * table is created, all the refcounters on the prog_list are bumped (to
> >> >> +      * track each table that references the prog). When a new prog is
> >> >> +      * added, it's just prepended to the list for the new table to point
> >> >> +      * at.
> >> >> +      *
> >> >> +      * Manage all the possible errors before this step to not uselessly
> >> >> +      * duplicate current_prog_set and avoid a rollback.
> >> >> +      */
> >> >> +     if (!new_prog_set) {
> >> >> +             /*
> >> >> +              * If there is no Landlock program set used by the current task,
> >> >> +              * then create a new one.
> >> >> +              */
> >> >> +             new_prog_set = new_landlock_prog_set();
> >> >> +             if (IS_ERR(new_prog_set))
> >> >> +                     goto put_tmp_lists;
> >> >> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
> >> >> +             /*
> >> >> +              * If the current task is not the sole user of its Landlock
> >> >> +              * program set, then duplicate them.
> >> >> +              */
> >> >> +             new_prog_set = new_landlock_prog_set();
> >> >> +             if (IS_ERR(new_prog_set))
> >> >> +                     goto put_tmp_lists;
> >> >> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
> >> >> +                     new_prog_set->programs[i] =
> >> >> +                             READ_ONCE(current_prog_set->programs[i]);
> >> >> +                     if (new_prog_set->programs[i])
> >> >> +                             refcount_inc(&new_prog_set->programs[i]->usage);
> >> >> +             }
> >> >> +
> >> >> +             /*
> >> >> +              * Landlock program set from the current task will not be freed
> >> >> +              * here because the usage is strictly greater than 1. It is
> >> >> +              * only prevented to be freed by another task thanks to the
> >> >> +              * caller of landlock_prepend_prog() which should be locked if
> >> >> +              * needed.
> >> >> +              */
> >> >> +             landlock_put_prog_set(current_prog_set);
> >> >> +     }
> >> >> +
> >> >> +     /* prepend tmp_prog_set to new_prog_set */
> >> >> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
> >> >> +             /* get the last new list */
> >> >> +             struct landlock_prog_list *last_list =
> >> >> +                     tmp_prog_set.programs[i];
> >> >> +
> >> >> +             if (last_list) {
> >> >> +                     while (last_list->prev)
> >> >> +                             last_list = last_list->prev;
> >> >> +                     /* no need to increment usage (pointer replacement) */
> >> >> +                     last_list->prev = new_prog_set->programs[i];
> >> >> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
> >> >> +             }
> >> >> +     }
> >> >> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
> >> >> +     return new_prog_set;
> >> >> +
> >> >> +put_tmp_lists:
> >> >> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
> >> >> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
> >> >> +     return new_prog_set;
> >> >> +}
> >> >
> >> > Nack on the chaining concept.
> >> > Please do not reinvent the wheel.
> >> > There is an existing mechanism for attaching/detaching/quering multiple
> >> > programs attached to cgroup and tracing hooks that are also
> >> > efficiently executed via BPF_PROG_RUN_ARRAY.
> >> > Please use that instead.
> >> >
> >>
> >> I don't see how that would help.  Suppose you add a filter, then
> >> fork(), and then the child adds another filter.  Do you want to
> >> duplicate the entire array?  You certainly can't *modify* the array
> >> because you'll affect processes that shouldn't be affected.
> >>
> >> In contrast, doing this through seccomp like the earlier patches
> >> seemed just fine to me, and seccomp already had the right logic.
> >
> > it doesn't look to me that existing seccomp side of managing fork
> > situation can be reused. Here there is an attempt to add 'chaining'
> > concept which sort of an extension of existing seccomp style,
> > but somehow heavily done on bpf side and contradicts cgroup/tracing.
> >
> 
> I don't see why the seccomp way can't be used.  I agree with you that
> the seccomp *style* shouldn't be used in bpf code like this, but I
> think that Landlock programs can and should just live in the existing
> seccomp chain.  If the existing seccomp code needs some modification
> to make this work, then so be it.

+1
if that was the case...
but that's not my reading of the patch set.

> In other words, the kernel already has two kinds of chaining:
> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
> across fork(), whereas seccomp's already handles that case correctly.
> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>  So IMO Landlock should use the seccomp core code and call into bpf
> for the actual filtering.

+1
in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
since cgroup hierarchy can be complicated with bpf progs attached
at different levels with different override/multiprog properties,
so walking link list and checking all flags at run-time would have
been too slow. That's why we added compute_effective_progs().

imo doing similar stuff at fork is not a big deal either.
allocating new bpf_prog_array for the task and computing
array of progs is trivial.

> For what it's worth, I haven't figured out that Mickaël means about
> chaining restrictions involving cookies.  This seems like something
> wrong and the design is too complicated.

the cookie part I don't like either.
In networking land we have cb in skb and xdp metadata to pass data
between programs.
imo right now for landlock there is no need to do any of this stuff.
It's purely a feature extension that is clearly controversial and
blocking even basic review of the rest of the patches.
Andy Lutomirski Feb. 27, 2018, 4:39 p.m. UTC | #6
On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>> >> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>> >> <alexei.starovoitov@gmail.com> wrote:
>> >> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>> >> >> The seccomp(2) syscall can be used by a task to apply a Landlock program
>> >> >> to itself. As a seccomp filter, a Landlock program is enforced for the
>> >> >> current task and all its future children. A program is immutable and a
>> >> >> task can only add new restricting programs to itself, forming a list of
>> >> >> programss.
>> >> >>
>> >> >> A Landlock program is tied to a Landlock hook. If the action on a kernel
>> >> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
>> >> >> capabilities, other LSM), then a Landlock hook related to this kind of
>> >> >> object is triggered. The list of programs for this hook is then
>> >> >> evaluated. Each program return a 32-bit value which can deny the action
>> >> >> on a kernel object with a non-zero value. If every programs of the list
>> >> >> return zero, then the action on the object is allowed.
>> >> >>
>> >> >> Multiple Landlock programs can be chained to share a 64-bits value for a
>> >> >> call chain (e.g. evaluating multiple elements of a file path).  This
>> >> >> chaining is restricted when a process construct this chain by loading a
>> >> >> program, but additional checks are performed when it requests to apply
>> >> >> this chain of programs to itself.  The restrictions ensure that it is
>> >> >> not possible to call multiple programs in a way that would imply to
>> >> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
>> >> >> only a fs_pick program can be chained to the same type of program,
>> >> >> because it may make sense if they have different triggers (cf. next
>> >> >> commits).  This restrictions still allows to reuse Landlock programs in
>> >> >> a safe way (e.g. use the same loaded fs_walk program with multiple
>> >> >> chains of fs_pick programs).
>> >> >>
>> >> >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> >> >
>> >> > ...
>> >> >
>> >> >> +struct landlock_prog_set *landlock_prepend_prog(
>> >> >> +             struct landlock_prog_set *current_prog_set,
>> >> >> +             struct bpf_prog *prog)
>> >> >> +{
>> >> >> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>> >> >> +     unsigned long pages;
>> >> >> +     int err;
>> >> >> +     size_t i;
>> >> >> +     struct landlock_prog_set tmp_prog_set = {};
>> >> >> +
>> >> >> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>> >> >> +             return ERR_PTR(-EINVAL);
>> >> >> +
>> >> >> +     /* validate memory size allocation */
>> >> >> +     pages = prog->pages;
>> >> >> +     if (current_prog_set) {
>> >> >> +             size_t i;
>> >> >> +
>> >> >> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>> >> >> +                     struct landlock_prog_list *walker_p;
>> >> >> +
>> >> >> +                     for (walker_p = current_prog_set->programs[i];
>> >> >> +                                     walker_p; walker_p = walker_p->prev)
>> >> >> +                             pages += walker_p->prog->pages;
>> >> >> +             }
>> >> >> +             /* count a struct landlock_prog_set if we need to allocate one */
>> >> >> +             if (refcount_read(&current_prog_set->usage) != 1)
>> >> >> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>> >> >> +                             / PAGE_SIZE;
>> >> >> +     }
>> >> >> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>> >> >> +             return ERR_PTR(-E2BIG);
>> >> >> +
>> >> >> +     /* ensure early that we can allocate enough memory for the new
>> >> >> +      * prog_lists */
>> >> >> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>> >> >> +     if (err)
>> >> >> +             return ERR_PTR(err);
>> >> >> +
>> >> >> +     /*
>> >> >> +      * Each task_struct points to an array of prog list pointers.  These
>> >> >> +      * tables are duplicated when additions are made (which means each
>> >> >> +      * table needs to be refcounted for the processes using it). When a new
>> >> >> +      * table is created, all the refcounters on the prog_list are bumped (to
>> >> >> +      * track each table that references the prog). When a new prog is
>> >> >> +      * added, it's just prepended to the list for the new table to point
>> >> >> +      * at.
>> >> >> +      *
>> >> >> +      * Manage all the possible errors before this step to not uselessly
>> >> >> +      * duplicate current_prog_set and avoid a rollback.
>> >> >> +      */
>> >> >> +     if (!new_prog_set) {
>> >> >> +             /*
>> >> >> +              * If there is no Landlock program set used by the current task,
>> >> >> +              * then create a new one.
>> >> >> +              */
>> >> >> +             new_prog_set = new_landlock_prog_set();
>> >> >> +             if (IS_ERR(new_prog_set))
>> >> >> +                     goto put_tmp_lists;
>> >> >> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>> >> >> +             /*
>> >> >> +              * If the current task is not the sole user of its Landlock
>> >> >> +              * program set, then duplicate them.
>> >> >> +              */
>> >> >> +             new_prog_set = new_landlock_prog_set();
>> >> >> +             if (IS_ERR(new_prog_set))
>> >> >> +                     goto put_tmp_lists;
>> >> >> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>> >> >> +                     new_prog_set->programs[i] =
>> >> >> +                             READ_ONCE(current_prog_set->programs[i]);
>> >> >> +                     if (new_prog_set->programs[i])
>> >> >> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>> >> >> +             }
>> >> >> +
>> >> >> +             /*
>> >> >> +              * Landlock program set from the current task will not be freed
>> >> >> +              * here because the usage is strictly greater than 1. It is
>> >> >> +              * only prevented to be freed by another task thanks to the
>> >> >> +              * caller of landlock_prepend_prog() which should be locked if
>> >> >> +              * needed.
>> >> >> +              */
>> >> >> +             landlock_put_prog_set(current_prog_set);
>> >> >> +     }
>> >> >> +
>> >> >> +     /* prepend tmp_prog_set to new_prog_set */
>> >> >> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>> >> >> +             /* get the last new list */
>> >> >> +             struct landlock_prog_list *last_list =
>> >> >> +                     tmp_prog_set.programs[i];
>> >> >> +
>> >> >> +             if (last_list) {
>> >> >> +                     while (last_list->prev)
>> >> >> +                             last_list = last_list->prev;
>> >> >> +                     /* no need to increment usage (pointer replacement) */
>> >> >> +                     last_list->prev = new_prog_set->programs[i];
>> >> >> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>> >> >> +             }
>> >> >> +     }
>> >> >> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>> >> >> +     return new_prog_set;
>> >> >> +
>> >> >> +put_tmp_lists:
>> >> >> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>> >> >> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>> >> >> +     return new_prog_set;
>> >> >> +}
>> >> >
>> >> > Nack on the chaining concept.
>> >> > Please do not reinvent the wheel.
>> >> > There is an existing mechanism for attaching/detaching/quering multiple
>> >> > programs attached to cgroup and tracing hooks that are also
>> >> > efficiently executed via BPF_PROG_RUN_ARRAY.
>> >> > Please use that instead.
>> >> >
>> >>
>> >> I don't see how that would help.  Suppose you add a filter, then
>> >> fork(), and then the child adds another filter.  Do you want to
>> >> duplicate the entire array?  You certainly can't *modify* the array
>> >> because you'll affect processes that shouldn't be affected.
>> >>
>> >> In contrast, doing this through seccomp like the earlier patches
>> >> seemed just fine to me, and seccomp already had the right logic.
>> >
>> > it doesn't look to me that existing seccomp side of managing fork
>> > situation can be reused. Here there is an attempt to add 'chaining'
>> > concept which sort of an extension of existing seccomp style,
>> > but somehow heavily done on bpf side and contradicts cgroup/tracing.
>> >
>>
>> I don't see why the seccomp way can't be used.  I agree with you that
>> the seccomp *style* shouldn't be used in bpf code like this, but I
>> think that Landlock programs can and should just live in the existing
>> seccomp chain.  If the existing seccomp code needs some modification
>> to make this work, then so be it.
>
> +1
> if that was the case...
> but that's not my reading of the patch set.

An earlier version of the patch set used the seccomp filter chain.
Mickaël, what exactly was wrong with that approach other than that the
seccomp() syscall was awkward for you to use?  You could add a
seccomp_add_landlock_rule() syscall if you needed to.

As a side comment, why is this an LSM at all, let alone a non-stacking
LSM?  It would make a lot more sense to me to make Landlock depend on
having LSMs configured in but to call the landlock hooks directly from
the security_xyz() hooks.

>
>> In other words, the kernel already has two kinds of chaining:
>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>> across fork(), whereas seccomp's already handles that case correctly.
>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>  So IMO Landlock should use the seccomp core code and call into bpf
>> for the actual filtering.
>
> +1
> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
> since cgroup hierarchy can be complicated with bpf progs attached
> at different levels with different override/multiprog properties,
> so walking link list and checking all flags at run-time would have
> been too slow. That's why we added compute_effective_progs().

If we start adding override flags to Landlock, I think we're doing it
wrong.   With cgroup bpf programs, the whole mess is set up by the
administrator.  With seccomp, and with Landlock if done correctly, it
*won't* be set up by the administrator, so the chance that everyone
gets all the flags right is about zero.  All attached filters should
run unconditionally.
Casey Schaufler Feb. 27, 2018, 5:30 p.m. UTC | #7
On 2/27/2018 8:39 AM, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> [ Snip ]
> An earlier version of the patch set used the seccomp filter chain.
> Mickaël, what exactly was wrong with that approach other than that the
> seccomp() syscall was awkward for you to use?  You could add a
> seccomp_add_landlock_rule() syscall if you needed to.
>
> As a side comment, why is this an LSM at all, let alone a non-stacking
> LSM?  It would make a lot more sense to me to make Landlock depend on
> having LSMs configured in but to call the landlock hooks directly from
> the security_xyz() hooks.

Please, no. It is my serious intention to have at least the
infrastructure blob management in within a release or two, and
I think that's all Landlock needs. The security_xyz() hooks are
sufficiently hackish as it is without unnecessarily adding more
special cases.
Andy Lutomirski Feb. 27, 2018, 5:36 p.m. UTC | #8
On Tue, Feb 27, 2018 at 5:30 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 2/27/2018 8:39 AM, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> [ Snip ]
>> An earlier version of the patch set used the seccomp filter chain.
>> Mickaël, what exactly was wrong with that approach other than that the
>> seccomp() syscall was awkward for you to use?  You could add a
>> seccomp_add_landlock_rule() syscall if you needed to.
>>
>> As a side comment, why is this an LSM at all, let alone a non-stacking
>> LSM?  It would make a lot more sense to me to make Landlock depend on
>> having LSMs configured in but to call the landlock hooks directly from
>> the security_xyz() hooks.
>
> Please, no. It is my serious intention to have at least the
> infrastructure blob management in within a release or two, and
> I think that's all Landlock needs. The security_xyz() hooks are
> sufficiently hackish as it is without unnecessarily adding more
> special cases.
>
>

What do you mean by "infrastructure blob management"?
Casey Schaufler Feb. 27, 2018, 6:03 p.m. UTC | #9
On 2/27/2018 9:36 AM, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 5:30 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 2/27/2018 8:39 AM, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> [ Snip ]
>>> An earlier version of the patch set used the seccomp filter chain.
>>> Mickaël, what exactly was wrong with that approach other than that the
>>> seccomp() syscall was awkward for you to use?  You could add a
>>> seccomp_add_landlock_rule() syscall if you needed to.
>>>
>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>> having LSMs configured in but to call the landlock hooks directly from
>>> the security_xyz() hooks.
>> Please, no. It is my serious intention to have at least the
>> infrastructure blob management in within a release or two, and
>> I think that's all Landlock needs. The security_xyz() hooks are
>> sufficiently hackish as it is without unnecessarily adding more
>> special cases.
>>
>>
> What do you mean by "infrastructure blob management"?

Today each security module manages their own module specific data,
for example inode->i_security and file->f_security. This prevents
having two security modules that have inode or file data from being
used at the same time, because they both need to manage those fields.
Moving the management of the module specific data (aka "blobs") from
the security modules to the module infrastructure will allow those
modules to coexist. Restrictions apply, of course, but I don't think
that Landlock uses any of the facilities that would have issues.
Mickaël Salaün Feb. 27, 2018, 9:48 p.m. UTC | #10
On 27/02/2018 17:39, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>> programss.
>>>>>>>
>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>
>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>> chains of fs_pick programs).
>>>>>>>
>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>> +             struct bpf_prog *prog)
>>>>>>> +{
>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>> +     unsigned long pages;
>>>>>>> +     int err;
>>>>>>> +     size_t i;
>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>> +
>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>> +
>>>>>>> +     /* validate memory size allocation */
>>>>>>> +     pages = prog->pages;
>>>>>>> +     if (current_prog_set) {
>>>>>>> +             size_t i;
>>>>>>> +
>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>> +
>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>> +             }
>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>> +                             / PAGE_SIZE;
>>>>>>> +     }
>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>> +
>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>> +      * prog_lists */
>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>> +     if (err)
>>>>>>> +             return ERR_PTR(err);
>>>>>>> +
>>>>>>> +     /*
>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>> +      * at.
>>>>>>> +      *
>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>> +      */
>>>>>>> +     if (!new_prog_set) {
>>>>>>> +             /*
>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>> +              * then create a new one.
>>>>>>> +              */
>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>> +                     goto put_tmp_lists;
>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>> +             /*
>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>> +              * program set, then duplicate them.
>>>>>>> +              */
>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>> +                     goto put_tmp_lists;
>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>> +             }
>>>>>>> +
>>>>>>> +             /*
>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>> +              * needed.
>>>>>>> +              */
>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>> +     }
>>>>>>> +
>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>> +             /* get the last new list */
>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>> +
>>>>>>> +             if (last_list) {
>>>>>>> +                     while (last_list->prev)
>>>>>>> +                             last_list = last_list->prev;
>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>> +             }
>>>>>>> +     }
>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>> +     return new_prog_set;
>>>>>>> +
>>>>>>> +put_tmp_lists:
>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>> +     return new_prog_set;
>>>>>>> +}
>>>>>>
>>>>>> Nack on the chaining concept.
>>>>>> Please do not reinvent the wheel.
>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>> Please use that instead.
>>>>>>
>>>>>
>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>> because you'll affect processes that shouldn't be affected.
>>>>>
>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>
>>>> it doesn't look to me that existing seccomp side of managing fork
>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>> concept which sort of an extension of existing seccomp style,
>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>
>>>
>>> I don't see why the seccomp way can't be used.  I agree with you that
>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>> think that Landlock programs can and should just live in the existing
>>> seccomp chain.  If the existing seccomp code needs some modification
>>> to make this work, then so be it.
>>
>> +1
>> if that was the case...
>> but that's not my reading of the patch set.
> 
> An earlier version of the patch set used the seccomp filter chain.
> Mickaël, what exactly was wrong with that approach other than that the
> seccomp() syscall was awkward for you to use?  You could add a
> seccomp_add_landlock_rule() syscall if you needed to.

Nothing was wrong about about that, this part did not changed (see my
next comment).

> 
> As a side comment, why is this an LSM at all, let alone a non-stacking
> LSM?  It would make a lot more sense to me to make Landlock depend on
> having LSMs configured in but to call the landlock hooks directly from
> the security_xyz() hooks.

See Casey's answer and his patch series: https://lwn.net/Articles/741963/

> 
>>
>>> In other words, the kernel already has two kinds of chaining:
>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>> across fork(), whereas seccomp's already handles that case correctly.
>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>> for the actual filtering.
>>
>> +1
>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>> since cgroup hierarchy can be complicated with bpf progs attached
>> at different levels with different override/multiprog properties,
>> so walking link list and checking all flags at run-time would have
>> been too slow. That's why we added compute_effective_progs().
> 
> If we start adding override flags to Landlock, I think we're doing it
> wrong.   With cgroup bpf programs, the whole mess is set up by the
> administrator.  With seccomp, and with Landlock if done correctly, it
> *won't* be set up by the administrator, so the chance that everyone
> gets all the flags right is about zero.  All attached filters should
> run unconditionally.


There is a misunderstanding about this chaining mechanism. This should
not be confused with the list of seccomp filters nor the cgroup
hierarchies. Landlock programs can be stacked the same way seccomp's
filters can (cf. struct landlock_prog_set, the "chain_last" field is an
optimization which is not used for this struct handling). This stackable
property did not changed from the previous patch series. The chaining
mechanism is for another use case, which does not make sense for seccomp
filters nor other eBPF program types, at least for now, from what I can
tell.

You may want to get a look at my talk at FOSDEM
(https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
slides 11 and 12.

Let me explain my reasoning about this program chaining thing.

To check if an action on a file is allowed, we first need to identify
this file and match it to the security policy. In a previous
(non-public) patch series, I tried to use one type of eBPF program to
check every kind of access to a file. To be able to identify a file, I
relied on an eBPF map, similar to the current inode map. This map store
a set of references to file descriptors. I then created a function
bpf_is_file_beneath() to check if the requested file was beneath a file
in the map. This way, no chaining, only one eBPF program type to check
an access to a file... but some issues then emerged. First, this design
create a side-channel which help an attacker using such a program to
infer some information not normally available, for example to get a hint
on where a file descriptor (received from a UNIX socket) come from.
Another issue is that this type of program would be called for each
component of a path. Indeed, when the kernel check if an access to a
file is allowed, it walk through all of the directories in its path
(checking if the current process is allowed to execute them). That first
attempt led me to rethink the way we could filter an access to a file
*path*.

To minimize the number of called to an eBPF program dedicated to
validate an access to a file path, I decided to create three subtype of
eBPF programs. The FS_WALK type is called when walking through every
directory of a file path (except the last one if it is the target). We
can then restrict this type of program to the minimum set of functions
it is allowed to call and the minimum set of data available from its
context. The first implicit chaining is for this type of program. To be
able to evaluate a path while being called for all its components, this
program need to store a state (to remember what was the parent directory
of this path). There is no "previous" field in the subtype for this
program because it is chained with itself, for each directories. This
enable to create a FS_WALK program to evaluate a file hierarchy, thank
to the inode map which can be used to check if a directory of this
hierarchy is part of an allowed (or denied) list of directories. This
design enables to express a file hierarchy in a programmatic way,
without requiring an eBPF helper to do the job (unlike my first experiment).

The explicit chaining is used to tied a path evaluation (with a FS_WALK
program) to an access to the actual file being requested (the last
component of a file path), with a FS_PICK program. It is only at this
time that the kernel check for the requested action (e.g. read, write,
chdir, append...). To be able to filter such access request we can have
one call to the same program for every action and let this program check
for which action it was called. However, this design does not allow the
kernel to know if the current action is indeed handled by this program.
Hence, it is not possible to implement a cache mechanism to only call
this program if it knows how to handle this action.

The approach I took for this FS_PICK type of program is to add to its
subtype which action it can handle (with the "triggers" bitfield, seen
as ORed actions). This way, the kernel knows if a call to a FS_PICK
program is necessary. If the user wants to enforce a different security
policy according to the action requested on a file, then it needs
multiple FS_PICK programs. However, to reduce the number of such
programs, this patch series allow a FS_PICK program to be chained with
another, the same way a FS_WALK is chained with itself. This way, if the
user want to check if the action is a for example an "open" and a "read"
and not a "map" and a "read", then it can chain multiple FS_PICK
programs with different triggers actions. The OR check performed by the
kernel is not a limitation then, only a way to know if a call to an eBPF
program is needed.

The last type of program is FS_GET. This one is called when a process
get a struct file or change its working directory. This is the only
program type able (and allowed) to tag a file. This restriction is
important to not being subject to resource exhaustion attacks (i.e.
tagging every inode accessible to an attacker, which would allocate too
much kernel memory).

This design gives room for improvements to create a cache of eBPF
context (input data, including maps if any), with the result of an eBPF
program. This would help limit the number of call to an eBPF program the
same way SELinux or other kernel components do to limit costly checks.

The eBPF maps of progs are useful to call the same type of eBPF
program. It does not fit with this use case because we may want multiple
eBPF program according to the action requested on a kernel object (e.g.
FS_GET). The other reason is because the eBPF program does not know what
will be the next (type of) access check performed by the kernel.

To say it another way, this chaining mechanism is a way to split a
kernel object evaluation with multiple specialized programs, each of
them being able to deal with data tied to their type. Using a monolithic
eBPF program to check everything does not scale and does not fit with
unprivileged use either.

As a side note, the cookie value is only an ephemeral value to keep a
state between multiple programs call. It can be used to create a state
machine for an object evaluation.

I don't see a way to do an efficient and programmatic path evaluation,
with different access checks, with the current eBPF features. Please let
me know if you know how to do it another way.
Mickaël Salaün April 8, 2018, 1:13 p.m. UTC | #11
On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
> 
> On 27/02/2018 17:39, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>> programss.
>>>>>>>>
>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>
>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>> chains of fs_pick programs).
>>>>>>>>
>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>> +{
>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>> +     unsigned long pages;
>>>>>>>> +     int err;
>>>>>>>> +     size_t i;
>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>> +
>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>> +
>>>>>>>> +     /* validate memory size allocation */
>>>>>>>> +     pages = prog->pages;
>>>>>>>> +     if (current_prog_set) {
>>>>>>>> +             size_t i;
>>>>>>>> +
>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>> +
>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>> +             }
>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>> +     }
>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>> +
>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>> +      * prog_lists */
>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>> +     if (err)
>>>>>>>> +             return ERR_PTR(err);
>>>>>>>> +
>>>>>>>> +     /*
>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>> +      * at.
>>>>>>>> +      *
>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>> +      */
>>>>>>>> +     if (!new_prog_set) {
>>>>>>>> +             /*
>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>> +              * then create a new one.
>>>>>>>> +              */
>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>> +             /*
>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>> +              * program set, then duplicate them.
>>>>>>>> +              */
>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>> +             }
>>>>>>>> +
>>>>>>>> +             /*
>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>> +              * needed.
>>>>>>>> +              */
>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>> +     }
>>>>>>>> +
>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>> +             /* get the last new list */
>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>> +
>>>>>>>> +             if (last_list) {
>>>>>>>> +                     while (last_list->prev)
>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>> +             }
>>>>>>>> +     }
>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>> +     return new_prog_set;
>>>>>>>> +
>>>>>>>> +put_tmp_lists:
>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>> +     return new_prog_set;
>>>>>>>> +}
>>>>>>>
>>>>>>> Nack on the chaining concept.
>>>>>>> Please do not reinvent the wheel.
>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>> Please use that instead.
>>>>>>>
>>>>>>
>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>
>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>
>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>> concept which sort of an extension of existing seccomp style,
>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>
>>>>
>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>> think that Landlock programs can and should just live in the existing
>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>> to make this work, then so be it.
>>>
>>> +1
>>> if that was the case...
>>> but that's not my reading of the patch set.
>>
>> An earlier version of the patch set used the seccomp filter chain.
>> Mickaël, what exactly was wrong with that approach other than that the
>> seccomp() syscall was awkward for you to use?  You could add a
>> seccomp_add_landlock_rule() syscall if you needed to.
> 
> Nothing was wrong about about that, this part did not changed (see my
> next comment).
> 
>>
>> As a side comment, why is this an LSM at all, let alone a non-stacking
>> LSM?  It would make a lot more sense to me to make Landlock depend on
>> having LSMs configured in but to call the landlock hooks directly from
>> the security_xyz() hooks.
> 
> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
> 
>>
>>>
>>>> In other words, the kernel already has two kinds of chaining:
>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>> for the actual filtering.
>>>
>>> +1
>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>> since cgroup hierarchy can be complicated with bpf progs attached
>>> at different levels with different override/multiprog properties,
>>> so walking link list and checking all flags at run-time would have
>>> been too slow. That's why we added compute_effective_progs().
>>
>> If we start adding override flags to Landlock, I think we're doing it
>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>> administrator.  With seccomp, and with Landlock if done correctly, it
>> *won't* be set up by the administrator, so the chance that everyone
>> gets all the flags right is about zero.  All attached filters should
>> run unconditionally.
> 
> 
> There is a misunderstanding about this chaining mechanism. This should
> not be confused with the list of seccomp filters nor the cgroup
> hierarchies. Landlock programs can be stacked the same way seccomp's
> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
> optimization which is not used for this struct handling). This stackable
> property did not changed from the previous patch series. The chaining
> mechanism is for another use case, which does not make sense for seccomp
> filters nor other eBPF program types, at least for now, from what I can
> tell.
> 
> You may want to get a look at my talk at FOSDEM
> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
> slides 11 and 12.
> 
> Let me explain my reasoning about this program chaining thing.
> 
> To check if an action on a file is allowed, we first need to identify
> this file and match it to the security policy. In a previous
> (non-public) patch series, I tried to use one type of eBPF program to
> check every kind of access to a file. To be able to identify a file, I
> relied on an eBPF map, similar to the current inode map. This map store
> a set of references to file descriptors. I then created a function
> bpf_is_file_beneath() to check if the requested file was beneath a file
> in the map. This way, no chaining, only one eBPF program type to check
> an access to a file... but some issues then emerged. First, this design
> create a side-channel which help an attacker using such a program to
> infer some information not normally available, for example to get a hint
> on where a file descriptor (received from a UNIX socket) come from.
> Another issue is that this type of program would be called for each
> component of a path. Indeed, when the kernel check if an access to a
> file is allowed, it walk through all of the directories in its path
> (checking if the current process is allowed to execute them). That first
> attempt led me to rethink the way we could filter an access to a file
> *path*.
> 
> To minimize the number of called to an eBPF program dedicated to
> validate an access to a file path, I decided to create three subtype of
> eBPF programs. The FS_WALK type is called when walking through every
> directory of a file path (except the last one if it is the target). We
> can then restrict this type of program to the minimum set of functions
> it is allowed to call and the minimum set of data available from its
> context. The first implicit chaining is for this type of program. To be
> able to evaluate a path while being called for all its components, this
> program need to store a state (to remember what was the parent directory
> of this path). There is no "previous" field in the subtype for this
> program because it is chained with itself, for each directories. This
> enable to create a FS_WALK program to evaluate a file hierarchy, thank
> to the inode map which can be used to check if a directory of this
> hierarchy is part of an allowed (or denied) list of directories. This
> design enables to express a file hierarchy in a programmatic way,
> without requiring an eBPF helper to do the job (unlike my first experiment).
> 
> The explicit chaining is used to tied a path evaluation (with a FS_WALK
> program) to an access to the actual file being requested (the last
> component of a file path), with a FS_PICK program. It is only at this
> time that the kernel check for the requested action (e.g. read, write,
> chdir, append...). To be able to filter such access request we can have
> one call to the same program for every action and let this program check
> for which action it was called. However, this design does not allow the
> kernel to know if the current action is indeed handled by this program.
> Hence, it is not possible to implement a cache mechanism to only call
> this program if it knows how to handle this action.
> 
> The approach I took for this FS_PICK type of program is to add to its
> subtype which action it can handle (with the "triggers" bitfield, seen
> as ORed actions). This way, the kernel knows if a call to a FS_PICK
> program is necessary. If the user wants to enforce a different security
> policy according to the action requested on a file, then it needs
> multiple FS_PICK programs. However, to reduce the number of such
> programs, this patch series allow a FS_PICK program to be chained with
> another, the same way a FS_WALK is chained with itself. This way, if the
> user want to check if the action is a for example an "open" and a "read"
> and not a "map" and a "read", then it can chain multiple FS_PICK
> programs with different triggers actions. The OR check performed by the
> kernel is not a limitation then, only a way to know if a call to an eBPF
> program is needed.
> 
> The last type of program is FS_GET. This one is called when a process
> get a struct file or change its working directory. This is the only
> program type able (and allowed) to tag a file. This restriction is
> important to not being subject to resource exhaustion attacks (i.e.
> tagging every inode accessible to an attacker, which would allocate too
> much kernel memory).
> 
> This design gives room for improvements to create a cache of eBPF
> context (input data, including maps if any), with the result of an eBPF
> program. This would help limit the number of call to an eBPF program the
> same way SELinux or other kernel components do to limit costly checks.
> 
> The eBPF maps of progs are useful to call the same type of eBPF
> program. It does not fit with this use case because we may want multiple
> eBPF program according to the action requested on a kernel object (e.g.
> FS_GET). The other reason is because the eBPF program does not know what
> will be the next (type of) access check performed by the kernel.
> 
> To say it another way, this chaining mechanism is a way to split a
> kernel object evaluation with multiple specialized programs, each of
> them being able to deal with data tied to their type. Using a monolithic
> eBPF program to check everything does not scale and does not fit with
> unprivileged use either.
> 
> As a side note, the cookie value is only an ephemeral value to keep a
> state between multiple programs call. It can be used to create a state
> machine for an object evaluation.
> 
> I don't see a way to do an efficient and programmatic path evaluation,
> with different access checks, with the current eBPF features. Please let
> me know if you know how to do it another way.
> 

Andy, Alexei, Daniel, what do you think about this Landlock program
chaining and cookie?
Andy Lutomirski April 8, 2018, 9:06 p.m. UTC | #12
On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>
>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>> programss.
>>>>>>>>>
>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>
>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>
>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>> +{
>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>> +     unsigned long pages;
>>>>>>>>> +     int err;
>>>>>>>>> +     size_t i;
>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>> +
>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>> +
>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>> +     pages = prog->pages;
>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>> +             size_t i;
>>>>>>>>> +
>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>> +
>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>> +             }
>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>> +     }
>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>> +
>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>> +      * prog_lists */
>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>> +     if (err)
>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>> +
>>>>>>>>> +     /*
>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>> +      * at.
>>>>>>>>> +      *
>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>> +      */
>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>> +             /*
>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>> +              * then create a new one.
>>>>>>>>> +              */
>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>> +             /*
>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>> +              */
>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>> +             }
>>>>>>>>> +
>>>>>>>>> +             /*
>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>> +              * needed.
>>>>>>>>> +              */
>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>> +     }
>>>>>>>>> +
>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>> +             /* get the last new list */
>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>> +
>>>>>>>>> +             if (last_list) {
>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>> +             }
>>>>>>>>> +     }
>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>> +     return new_prog_set;
>>>>>>>>> +
>>>>>>>>> +put_tmp_lists:
>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>> +     return new_prog_set;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> Nack on the chaining concept.
>>>>>>>> Please do not reinvent the wheel.
>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>> Please use that instead.
>>>>>>>>
>>>>>>>
>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>
>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>
>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>
>>>>>
>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>> think that Landlock programs can and should just live in the existing
>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>> to make this work, then so be it.
>>>>
>>>> +1
>>>> if that was the case...
>>>> but that's not my reading of the patch set.
>>>
>>> An earlier version of the patch set used the seccomp filter chain.
>>> Mickaël, what exactly was wrong with that approach other than that the
>>> seccomp() syscall was awkward for you to use?  You could add a
>>> seccomp_add_landlock_rule() syscall if you needed to.
>>
>> Nothing was wrong about about that, this part did not changed (see my
>> next comment).
>>
>>>
>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>> having LSMs configured in but to call the landlock hooks directly from
>>> the security_xyz() hooks.
>>
>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>
>>>
>>>>
>>>>> In other words, the kernel already has two kinds of chaining:
>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>> for the actual filtering.
>>>>
>>>> +1
>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>> at different levels with different override/multiprog properties,
>>>> so walking link list and checking all flags at run-time would have
>>>> been too slow. That's why we added compute_effective_progs().
>>>
>>> If we start adding override flags to Landlock, I think we're doing it
>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>> *won't* be set up by the administrator, so the chance that everyone
>>> gets all the flags right is about zero.  All attached filters should
>>> run unconditionally.
>>
>>
>> There is a misunderstanding about this chaining mechanism. This should
>> not be confused with the list of seccomp filters nor the cgroup
>> hierarchies. Landlock programs can be stacked the same way seccomp's
>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>> optimization which is not used for this struct handling). This stackable
>> property did not changed from the previous patch series. The chaining
>> mechanism is for another use case, which does not make sense for seccomp
>> filters nor other eBPF program types, at least for now, from what I can
>> tell.
>>
>> You may want to get a look at my talk at FOSDEM
>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>> slides 11 and 12.
>>
>> Let me explain my reasoning about this program chaining thing.
>>
>> To check if an action on a file is allowed, we first need to identify
>> this file and match it to the security policy. In a previous
>> (non-public) patch series, I tried to use one type of eBPF program to
>> check every kind of access to a file. To be able to identify a file, I
>> relied on an eBPF map, similar to the current inode map. This map store
>> a set of references to file descriptors. I then created a function
>> bpf_is_file_beneath() to check if the requested file was beneath a file
>> in the map. This way, no chaining, only one eBPF program type to check
>> an access to a file... but some issues then emerged. First, this design
>> create a side-channel which help an attacker using such a program to
>> infer some information not normally available, for example to get a hint
>> on where a file descriptor (received from a UNIX socket) come from.
>> Another issue is that this type of program would be called for each
>> component of a path. Indeed, when the kernel check if an access to a
>> file is allowed, it walk through all of the directories in its path
>> (checking if the current process is allowed to execute them). That first
>> attempt led me to rethink the way we could filter an access to a file
>> *path*.
>>
>> To minimize the number of called to an eBPF program dedicated to
>> validate an access to a file path, I decided to create three subtype of
>> eBPF programs. The FS_WALK type is called when walking through every
>> directory of a file path (except the last one if it is the target). We
>> can then restrict this type of program to the minimum set of functions
>> it is allowed to call and the minimum set of data available from its
>> context. The first implicit chaining is for this type of program. To be
>> able to evaluate a path while being called for all its components, this
>> program need to store a state (to remember what was the parent directory
>> of this path). There is no "previous" field in the subtype for this
>> program because it is chained with itself, for each directories. This
>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>> to the inode map which can be used to check if a directory of this
>> hierarchy is part of an allowed (or denied) list of directories. This
>> design enables to express a file hierarchy in a programmatic way,
>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>
>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>> program) to an access to the actual file being requested (the last
>> component of a file path), with a FS_PICK program. It is only at this
>> time that the kernel check for the requested action (e.g. read, write,
>> chdir, append...). To be able to filter such access request we can have
>> one call to the same program for every action and let this program check
>> for which action it was called. However, this design does not allow the
>> kernel to know if the current action is indeed handled by this program.
>> Hence, it is not possible to implement a cache mechanism to only call
>> this program if it knows how to handle this action.
>>
>> The approach I took for this FS_PICK type of program is to add to its
>> subtype which action it can handle (with the "triggers" bitfield, seen
>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>> program is necessary. If the user wants to enforce a different security
>> policy according to the action requested on a file, then it needs
>> multiple FS_PICK programs. However, to reduce the number of such
>> programs, this patch series allow a FS_PICK program to be chained with
>> another, the same way a FS_WALK is chained with itself. This way, if the
>> user want to check if the action is a for example an "open" and a "read"
>> and not a "map" and a "read", then it can chain multiple FS_PICK
>> programs with different triggers actions. The OR check performed by the
>> kernel is not a limitation then, only a way to know if a call to an eBPF
>> program is needed.
>>
>> The last type of program is FS_GET. This one is called when a process
>> get a struct file or change its working directory. This is the only
>> program type able (and allowed) to tag a file. This restriction is
>> important to not being subject to resource exhaustion attacks (i.e.
>> tagging every inode accessible to an attacker, which would allocate too
>> much kernel memory).
>>
>> This design gives room for improvements to create a cache of eBPF
>> context (input data, including maps if any), with the result of an eBPF
>> program. This would help limit the number of call to an eBPF program the
>> same way SELinux or other kernel components do to limit costly checks.
>>
>> The eBPF maps of progs are useful to call the same type of eBPF
>> program. It does not fit with this use case because we may want multiple
>> eBPF program according to the action requested on a kernel object (e.g.
>> FS_GET). The other reason is because the eBPF program does not know what
>> will be the next (type of) access check performed by the kernel.
>>
>> To say it another way, this chaining mechanism is a way to split a
>> kernel object evaluation with multiple specialized programs, each of
>> them being able to deal with data tied to their type. Using a monolithic
>> eBPF program to check everything does not scale and does not fit with
>> unprivileged use either.
>>
>> As a side note, the cookie value is only an ephemeral value to keep a
>> state between multiple programs call. It can be used to create a state
>> machine for an object evaluation.
>>
>> I don't see a way to do an efficient and programmatic path evaluation,
>> with different access checks, with the current eBPF features. Please let
>> me know if you know how to do it another way.
>>
>
> Andy, Alexei, Daniel, what do you think about this Landlock program
> chaining and cookie?
>

Can you give a small pseudocode real world example that acutally needs
chaining?  The mechanism is quite complicated and I'd like to
understand how it'll be used.
Mickaël Salaün April 8, 2018, 10:01 p.m. UTC | #13
On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
> On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>>
>>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>>> programss.
>>>>>>>>>>
>>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>>
>>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>>> +{
>>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>>> +     unsigned long pages;
>>>>>>>>>> +     int err;
>>>>>>>>>> +     size_t i;
>>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>>> +
>>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>>> +
>>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>>> +     pages = prog->pages;
>>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>>> +             size_t i;
>>>>>>>>>> +
>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>>> +
>>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>>> +             }
>>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>>> +     }
>>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>>> +
>>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>>> +      * prog_lists */
>>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>>> +     if (err)
>>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>>> +
>>>>>>>>>> +     /*
>>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>>> +      * at.
>>>>>>>>>> +      *
>>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>>> +      */
>>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>>> +             /*
>>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>>> +              * then create a new one.
>>>>>>>>>> +              */
>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>>> +             /*
>>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>>> +              */
>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>>> +             }
>>>>>>>>>> +
>>>>>>>>>> +             /*
>>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>>> +              * needed.
>>>>>>>>>> +              */
>>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>>> +     }
>>>>>>>>>> +
>>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>>> +             /* get the last new list */
>>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>>> +
>>>>>>>>>> +             if (last_list) {
>>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>>> +             }
>>>>>>>>>> +     }
>>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>> +
>>>>>>>>>> +put_tmp_lists:
>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> Nack on the chaining concept.
>>>>>>>>> Please do not reinvent the wheel.
>>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>>> Please use that instead.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>>
>>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>>
>>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>>
>>>>>>
>>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>>> think that Landlock programs can and should just live in the existing
>>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>>> to make this work, then so be it.
>>>>>
>>>>> +1
>>>>> if that was the case...
>>>>> but that's not my reading of the patch set.
>>>>
>>>> An earlier version of the patch set used the seccomp filter chain.
>>>> Mickaël, what exactly was wrong with that approach other than that the
>>>> seccomp() syscall was awkward for you to use?  You could add a
>>>> seccomp_add_landlock_rule() syscall if you needed to.
>>>
>>> Nothing was wrong about about that, this part did not changed (see my
>>> next comment).
>>>
>>>>
>>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>>> having LSMs configured in but to call the landlock hooks directly from
>>>> the security_xyz() hooks.
>>>
>>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>>
>>>>
>>>>>
>>>>>> In other words, the kernel already has two kinds of chaining:
>>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>>> for the actual filtering.
>>>>>
>>>>> +1
>>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>>> at different levels with different override/multiprog properties,
>>>>> so walking link list and checking all flags at run-time would have
>>>>> been too slow. That's why we added compute_effective_progs().
>>>>
>>>> If we start adding override flags to Landlock, I think we're doing it
>>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>>> *won't* be set up by the administrator, so the chance that everyone
>>>> gets all the flags right is about zero.  All attached filters should
>>>> run unconditionally.
>>>
>>>
>>> There is a misunderstanding about this chaining mechanism. This should
>>> not be confused with the list of seccomp filters nor the cgroup
>>> hierarchies. Landlock programs can be stacked the same way seccomp's
>>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>>> optimization which is not used for this struct handling). This stackable
>>> property did not changed from the previous patch series. The chaining
>>> mechanism is for another use case, which does not make sense for seccomp
>>> filters nor other eBPF program types, at least for now, from what I can
>>> tell.
>>>
>>> You may want to get a look at my talk at FOSDEM
>>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>>> slides 11 and 12.
>>>
>>> Let me explain my reasoning about this program chaining thing.
>>>
>>> To check if an action on a file is allowed, we first need to identify
>>> this file and match it to the security policy. In a previous
>>> (non-public) patch series, I tried to use one type of eBPF program to
>>> check every kind of access to a file. To be able to identify a file, I
>>> relied on an eBPF map, similar to the current inode map. This map store
>>> a set of references to file descriptors. I then created a function
>>> bpf_is_file_beneath() to check if the requested file was beneath a file
>>> in the map. This way, no chaining, only one eBPF program type to check
>>> an access to a file... but some issues then emerged. First, this design
>>> create a side-channel which help an attacker using such a program to
>>> infer some information not normally available, for example to get a hint
>>> on where a file descriptor (received from a UNIX socket) come from.
>>> Another issue is that this type of program would be called for each
>>> component of a path. Indeed, when the kernel check if an access to a
>>> file is allowed, it walk through all of the directories in its path
>>> (checking if the current process is allowed to execute them). That first
>>> attempt led me to rethink the way we could filter an access to a file
>>> *path*.
>>>
>>> To minimize the number of called to an eBPF program dedicated to
>>> validate an access to a file path, I decided to create three subtype of
>>> eBPF programs. The FS_WALK type is called when walking through every
>>> directory of a file path (except the last one if it is the target). We
>>> can then restrict this type of program to the minimum set of functions
>>> it is allowed to call and the minimum set of data available from its
>>> context. The first implicit chaining is for this type of program. To be
>>> able to evaluate a path while being called for all its components, this
>>> program need to store a state (to remember what was the parent directory
>>> of this path). There is no "previous" field in the subtype for this
>>> program because it is chained with itself, for each directories. This
>>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>>> to the inode map which can be used to check if a directory of this
>>> hierarchy is part of an allowed (or denied) list of directories. This
>>> design enables to express a file hierarchy in a programmatic way,
>>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>>
>>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>>> program) to an access to the actual file being requested (the last
>>> component of a file path), with a FS_PICK program. It is only at this
>>> time that the kernel check for the requested action (e.g. read, write,
>>> chdir, append...). To be able to filter such access request we can have
>>> one call to the same program for every action and let this program check
>>> for which action it was called. However, this design does not allow the
>>> kernel to know if the current action is indeed handled by this program.
>>> Hence, it is not possible to implement a cache mechanism to only call
>>> this program if it knows how to handle this action.
>>>
>>> The approach I took for this FS_PICK type of program is to add to its
>>> subtype which action it can handle (with the "triggers" bitfield, seen
>>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>>> program is necessary. If the user wants to enforce a different security
>>> policy according to the action requested on a file, then it needs
>>> multiple FS_PICK programs. However, to reduce the number of such
>>> programs, this patch series allow a FS_PICK program to be chained with
>>> another, the same way a FS_WALK is chained with itself. This way, if the
>>> user want to check if the action is a for example an "open" and a "read"
>>> and not a "map" and a "read", then it can chain multiple FS_PICK
>>> programs with different triggers actions. The OR check performed by the
>>> kernel is not a limitation then, only a way to know if a call to an eBPF
>>> program is needed.
>>>
>>> The last type of program is FS_GET. This one is called when a process
>>> get a struct file or change its working directory. This is the only
>>> program type able (and allowed) to tag a file. This restriction is
>>> important to not being subject to resource exhaustion attacks (i.e.
>>> tagging every inode accessible to an attacker, which would allocate too
>>> much kernel memory).
>>>
>>> This design gives room for improvements to create a cache of eBPF
>>> context (input data, including maps if any), with the result of an eBPF
>>> program. This would help limit the number of call to an eBPF program the
>>> same way SELinux or other kernel components do to limit costly checks.
>>>
>>> The eBPF maps of progs are useful to call the same type of eBPF
>>> program. It does not fit with this use case because we may want multiple
>>> eBPF program according to the action requested on a kernel object (e.g.
>>> FS_GET). The other reason is because the eBPF program does not know what
>>> will be the next (type of) access check performed by the kernel.
>>>
>>> To say it another way, this chaining mechanism is a way to split a
>>> kernel object evaluation with multiple specialized programs, each of
>>> them being able to deal with data tied to their type. Using a monolithic
>>> eBPF program to check everything does not scale and does not fit with
>>> unprivileged use either.
>>>
>>> As a side note, the cookie value is only an ephemeral value to keep a
>>> state between multiple programs call. It can be used to create a state
>>> machine for an object evaluation.
>>>
>>> I don't see a way to do an efficient and programmatic path evaluation,
>>> with different access checks, with the current eBPF features. Please let
>>> me know if you know how to do it another way.
>>>
>>
>> Andy, Alexei, Daniel, what do you think about this Landlock program
>> chaining and cookie?
>>
> 
> Can you give a small pseudocode real world example that acutally needs
> chaining?  The mechanism is quite complicated and I'd like to
> understand how it'll be used.
> 

Here is the interesting part from the example (patch 09/11):

+SEC("maps")
+struct bpf_map_def inode_map = {
+	.type = BPF_MAP_TYPE_INODE,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u64),
+	.max_entries = 20,
+};
+
+SEC("subtype/landlock1")
+static union bpf_prog_subtype _subtype1 = {
+	.landlock_hook = {
+		.type = LANDLOCK_HOOK_FS_WALK,
+	}
+};
+
+static __always_inline __u64 update_cookie(__u64 cookie, __u8 lookup,
+		void *inode, void *chain, bool freeze)
+{
+	__u64 map_allow = 0;
+
+	if (cookie == 0) {
+		cookie = bpf_inode_get_tag(inode, chain);
+		if (cookie)
+			return cookie;
+		/* only look for the first match in the map, ignore nested
+		 * paths in this example */
+		map_allow = bpf_inode_map_lookup(&inode_map, inode);
+		if (map_allow)
+			cookie = 1 | map_allow;
+	} else {
+		if (cookie & COOKIE_VALUE_FREEZED)
+			return cookie;
+		map_allow = cookie & _MAP_MARK_MASK;
+		cookie &= ~_MAP_MARK_MASK;
+		switch (lookup) {
+		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOTDOT:
+			cookie--;
+			break;
+		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOT:
+			break;
+		default:
+			/* ignore _MAP_MARK_MASK overflow in this example */
+			cookie++;
+			break;
+		}
+		if (cookie >= 1)
+			cookie |= map_allow;
+	}
+	/* do not modify the cookie for each fs_pick */
+	if (freeze && cookie)
+		cookie |= COOKIE_VALUE_FREEZED;
+	return cookie;
+}
+
+SEC("landlock1")
+int fs_walk(struct landlock_ctx_fs_walk *ctx)
+{
+	ctx->cookie = update_cookie(ctx->cookie, ctx->inode_lookup,
+			(void *)ctx->inode, (void *)ctx->chain, false);
+	return LANDLOCK_RET_ALLOW;
+}

The program "landlock1" is called for every directory execution (except
the last one if it is the leaf of a path). This enables to identify a
file hierarchy with only a (one dimension) list of file descriptors
(i.e. inode_map).

Underneath, the Landlock LSM part looks if there is an associated path
walk (nameidata) with each inode access request. If there is one, then
the cookie associated with the path walk (if any) is made available
through the eBPF program context. This enables to develop a state
machine with an eBPF program to "evaluate" a file path (without string
parsing).

The goal with this chaining mechanism is to be able to express a complex
kernel object like a file, with multiple run of one or more eBPF
programs, as a multilayer evaluation. This semantic may only make sense
for the user/developer and his security policy. We must keep in mind
that this object identification should be available to unprivileged
processes. This means that we must be very careful to what kind of
information are available to an eBPF program because this can then leak
to a process (e.g. through a map). With this mechanism, only information
already available to user space is available to the eBPF program.

In this example, the complexity of the path evaluation is in the eBPF
program. We can then keep the kernel code more simple and generic. This
enables more flexibility for a security policy definition.
Alexei Starovoitov April 10, 2018, 4:48 a.m. UTC | #14
On Mon, Apr 09, 2018 at 12:01:59AM +0200, Mickaël Salaün wrote:
> 
> On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
> > On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
> >>
> >> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
> >>>
> >>> On 27/02/2018 17:39, Andy Lutomirski wrote:
> >>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
> >>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
> >>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
> >>>>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
> >>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
> >>>>>>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
> >>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
> >>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
> >>>>>>>>>> current task and all its future children. A program is immutable and a
> >>>>>>>>>> task can only add new restricting programs to itself, forming a list of
> >>>>>>>>>> programss.
> >>>>>>>>>>
> >>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
> >>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
> >>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
> >>>>>>>>>> object is triggered. The list of programs for this hook is then
> >>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
> >>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
> >>>>>>>>>> return zero, then the action on the object is allowed.
> >>>>>>>>>>
> >>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
> >>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
> >>>>>>>>>> chaining is restricted when a process construct this chain by loading a
> >>>>>>>>>> program, but additional checks are performed when it requests to apply
> >>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
> >>>>>>>>>> not possible to call multiple programs in a way that would imply to
> >>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
> >>>>>>>>>> only a fs_pick program can be chained to the same type of program,
> >>>>>>>>>> because it may make sense if they have different triggers (cf. next
> >>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
> >>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
> >>>>>>>>>> chains of fs_pick programs).
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> >>>>>>>>>
> >>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
> >>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
> >>>>>>>>>> +             struct bpf_prog *prog)
> >>>>>>>>>> +{
> >>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
> >>>>>>>>>> +     unsigned long pages;
> >>>>>>>>>> +     int err;
> >>>>>>>>>> +     size_t i;
> >>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
> >>>>>>>>>> +
> >>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
> >>>>>>>>>> +             return ERR_PTR(-EINVAL);
> >>>>>>>>>> +
> >>>>>>>>>> +     /* validate memory size allocation */
> >>>>>>>>>> +     pages = prog->pages;
> >>>>>>>>>> +     if (current_prog_set) {
> >>>>>>>>>> +             size_t i;
> >>>>>>>>>> +
> >>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
> >>>>>>>>>> +                     struct landlock_prog_list *walker_p;
> >>>>>>>>>> +
> >>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
> >>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
> >>>>>>>>>> +                             pages += walker_p->prog->pages;
> >>>>>>>>>> +             }
> >>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
> >>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
> >>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
> >>>>>>>>>> +                             / PAGE_SIZE;
> >>>>>>>>>> +     }
> >>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
> >>>>>>>>>> +             return ERR_PTR(-E2BIG);
> >>>>>>>>>> +
> >>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
> >>>>>>>>>> +      * prog_lists */
> >>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
> >>>>>>>>>> +     if (err)
> >>>>>>>>>> +             return ERR_PTR(err);
> >>>>>>>>>> +
> >>>>>>>>>> +     /*
> >>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
> >>>>>>>>>> +      * tables are duplicated when additions are made (which means each
> >>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
> >>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
> >>>>>>>>>> +      * track each table that references the prog). When a new prog is
> >>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
> >>>>>>>>>> +      * at.
> >>>>>>>>>> +      *
> >>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
> >>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
> >>>>>>>>>> +      */
> >>>>>>>>>> +     if (!new_prog_set) {
> >>>>>>>>>> +             /*
> >>>>>>>>>> +              * If there is no Landlock program set used by the current task,
> >>>>>>>>>> +              * then create a new one.
> >>>>>>>>>> +              */
> >>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
> >>>>>>>>>> +             if (IS_ERR(new_prog_set))
> >>>>>>>>>> +                     goto put_tmp_lists;
> >>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
> >>>>>>>>>> +             /*
> >>>>>>>>>> +              * If the current task is not the sole user of its Landlock
> >>>>>>>>>> +              * program set, then duplicate them.
> >>>>>>>>>> +              */
> >>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
> >>>>>>>>>> +             if (IS_ERR(new_prog_set))
> >>>>>>>>>> +                     goto put_tmp_lists;
> >>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
> >>>>>>>>>> +                     new_prog_set->programs[i] =
> >>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
> >>>>>>>>>> +                     if (new_prog_set->programs[i])
> >>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
> >>>>>>>>>> +             }
> >>>>>>>>>> +
> >>>>>>>>>> +             /*
> >>>>>>>>>> +              * Landlock program set from the current task will not be freed
> >>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
> >>>>>>>>>> +              * only prevented to be freed by another task thanks to the
> >>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
> >>>>>>>>>> +              * needed.
> >>>>>>>>>> +              */
> >>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
> >>>>>>>>>> +     }
> >>>>>>>>>> +
> >>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
> >>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
> >>>>>>>>>> +             /* get the last new list */
> >>>>>>>>>> +             struct landlock_prog_list *last_list =
> >>>>>>>>>> +                     tmp_prog_set.programs[i];
> >>>>>>>>>> +
> >>>>>>>>>> +             if (last_list) {
> >>>>>>>>>> +                     while (last_list->prev)
> >>>>>>>>>> +                             last_list = last_list->prev;
> >>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
> >>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
> >>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
> >>>>>>>>>> +             }
> >>>>>>>>>> +     }
> >>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
> >>>>>>>>>> +     return new_prog_set;
> >>>>>>>>>> +
> >>>>>>>>>> +put_tmp_lists:
> >>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
> >>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
> >>>>>>>>>> +     return new_prog_set;
> >>>>>>>>>> +}
> >>>>>>>>>
> >>>>>>>>> Nack on the chaining concept.
> >>>>>>>>> Please do not reinvent the wheel.
> >>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
> >>>>>>>>> programs attached to cgroup and tracing hooks that are also
> >>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
> >>>>>>>>> Please use that instead.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't see how that would help.  Suppose you add a filter, then
> >>>>>>>> fork(), and then the child adds another filter.  Do you want to
> >>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
> >>>>>>>> because you'll affect processes that shouldn't be affected.
> >>>>>>>>
> >>>>>>>> In contrast, doing this through seccomp like the earlier patches
> >>>>>>>> seemed just fine to me, and seccomp already had the right logic.
> >>>>>>>
> >>>>>>> it doesn't look to me that existing seccomp side of managing fork
> >>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
> >>>>>>> concept which sort of an extension of existing seccomp style,
> >>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
> >>>>>>>
> >>>>>>
> >>>>>> I don't see why the seccomp way can't be used.  I agree with you that
> >>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
> >>>>>> think that Landlock programs can and should just live in the existing
> >>>>>> seccomp chain.  If the existing seccomp code needs some modification
> >>>>>> to make this work, then so be it.
> >>>>>
> >>>>> +1
> >>>>> if that was the case...
> >>>>> but that's not my reading of the patch set.
> >>>>
> >>>> An earlier version of the patch set used the seccomp filter chain.
> >>>> Mickaël, what exactly was wrong with that approach other than that the
> >>>> seccomp() syscall was awkward for you to use?  You could add a
> >>>> seccomp_add_landlock_rule() syscall if you needed to.
> >>>
> >>> Nothing was wrong about about that, this part did not changed (see my
> >>> next comment).
> >>>
> >>>>
> >>>> As a side comment, why is this an LSM at all, let alone a non-stacking
> >>>> LSM?  It would make a lot more sense to me to make Landlock depend on
> >>>> having LSMs configured in but to call the landlock hooks directly from
> >>>> the security_xyz() hooks.
> >>>
> >>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
> >>>
> >>>>
> >>>>>
> >>>>>> In other words, the kernel already has two kinds of chaining:
> >>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
> >>>>>> across fork(), whereas seccomp's already handles that case correctly.
> >>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
> >>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
> >>>>>> for the actual filtering.
> >>>>>
> >>>>> +1
> >>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
> >>>>> since cgroup hierarchy can be complicated with bpf progs attached
> >>>>> at different levels with different override/multiprog properties,
> >>>>> so walking link list and checking all flags at run-time would have
> >>>>> been too slow. That's why we added compute_effective_progs().
> >>>>
> >>>> If we start adding override flags to Landlock, I think we're doing it
> >>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
> >>>> administrator.  With seccomp, and with Landlock if done correctly, it
> >>>> *won't* be set up by the administrator, so the chance that everyone
> >>>> gets all the flags right is about zero.  All attached filters should
> >>>> run unconditionally.
> >>>
> >>>
> >>> There is a misunderstanding about this chaining mechanism. This should
> >>> not be confused with the list of seccomp filters nor the cgroup
> >>> hierarchies. Landlock programs can be stacked the same way seccomp's
> >>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
> >>> optimization which is not used for this struct handling). This stackable
> >>> property did not changed from the previous patch series. The chaining
> >>> mechanism is for another use case, which does not make sense for seccomp
> >>> filters nor other eBPF program types, at least for now, from what I can
> >>> tell.
> >>>
> >>> You may want to get a look at my talk at FOSDEM
> >>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
> >>> slides 11 and 12.
> >>>
> >>> Let me explain my reasoning about this program chaining thing.
> >>>
> >>> To check if an action on a file is allowed, we first need to identify
> >>> this file and match it to the security policy. In a previous
> >>> (non-public) patch series, I tried to use one type of eBPF program to
> >>> check every kind of access to a file. To be able to identify a file, I
> >>> relied on an eBPF map, similar to the current inode map. This map store
> >>> a set of references to file descriptors. I then created a function
> >>> bpf_is_file_beneath() to check if the requested file was beneath a file
> >>> in the map. This way, no chaining, only one eBPF program type to check
> >>> an access to a file... but some issues then emerged. First, this design
> >>> create a side-channel which help an attacker using such a program to
> >>> infer some information not normally available, for example to get a hint
> >>> on where a file descriptor (received from a UNIX socket) come from.
> >>> Another issue is that this type of program would be called for each
> >>> component of a path. Indeed, when the kernel check if an access to a
> >>> file is allowed, it walk through all of the directories in its path
> >>> (checking if the current process is allowed to execute them). That first
> >>> attempt led me to rethink the way we could filter an access to a file
> >>> *path*.
> >>>
> >>> To minimize the number of called to an eBPF program dedicated to
> >>> validate an access to a file path, I decided to create three subtype of
> >>> eBPF programs. The FS_WALK type is called when walking through every
> >>> directory of a file path (except the last one if it is the target). We
> >>> can then restrict this type of program to the minimum set of functions
> >>> it is allowed to call and the minimum set of data available from its
> >>> context. The first implicit chaining is for this type of program. To be
> >>> able to evaluate a path while being called for all its components, this
> >>> program need to store a state (to remember what was the parent directory
> >>> of this path). There is no "previous" field in the subtype for this
> >>> program because it is chained with itself, for each directories. This
> >>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
> >>> to the inode map which can be used to check if a directory of this
> >>> hierarchy is part of an allowed (or denied) list of directories. This
> >>> design enables to express a file hierarchy in a programmatic way,
> >>> without requiring an eBPF helper to do the job (unlike my first experiment).
> >>>
> >>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
> >>> program) to an access to the actual file being requested (the last
> >>> component of a file path), with a FS_PICK program. It is only at this
> >>> time that the kernel check for the requested action (e.g. read, write,
> >>> chdir, append...). To be able to filter such access request we can have
> >>> one call to the same program for every action and let this program check
> >>> for which action it was called. However, this design does not allow the
> >>> kernel to know if the current action is indeed handled by this program.
> >>> Hence, it is not possible to implement a cache mechanism to only call
> >>> this program if it knows how to handle this action.
> >>>
> >>> The approach I took for this FS_PICK type of program is to add to its
> >>> subtype which action it can handle (with the "triggers" bitfield, seen
> >>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
> >>> program is necessary. If the user wants to enforce a different security
> >>> policy according to the action requested on a file, then it needs
> >>> multiple FS_PICK programs. However, to reduce the number of such
> >>> programs, this patch series allow a FS_PICK program to be chained with
> >>> another, the same way a FS_WALK is chained with itself. This way, if the
> >>> user want to check if the action is a for example an "open" and a "read"
> >>> and not a "map" and a "read", then it can chain multiple FS_PICK
> >>> programs with different triggers actions. The OR check performed by the
> >>> kernel is not a limitation then, only a way to know if a call to an eBPF
> >>> program is needed.
> >>>
> >>> The last type of program is FS_GET. This one is called when a process
> >>> get a struct file or change its working directory. This is the only
> >>> program type able (and allowed) to tag a file. This restriction is
> >>> important to not being subject to resource exhaustion attacks (i.e.
> >>> tagging every inode accessible to an attacker, which would allocate too
> >>> much kernel memory).
> >>>
> >>> This design gives room for improvements to create a cache of eBPF
> >>> context (input data, including maps if any), with the result of an eBPF
> >>> program. This would help limit the number of call to an eBPF program the
> >>> same way SELinux or other kernel components do to limit costly checks.
> >>>
> >>> The eBPF maps of progs are useful to call the same type of eBPF
> >>> program. It does not fit with this use case because we may want multiple
> >>> eBPF program according to the action requested on a kernel object (e.g.
> >>> FS_GET). The other reason is because the eBPF program does not know what
> >>> will be the next (type of) access check performed by the kernel.
> >>>
> >>> To say it another way, this chaining mechanism is a way to split a
> >>> kernel object evaluation with multiple specialized programs, each of
> >>> them being able to deal with data tied to their type. Using a monolithic
> >>> eBPF program to check everything does not scale and does not fit with
> >>> unprivileged use either.
> >>>
> >>> As a side note, the cookie value is only an ephemeral value to keep a
> >>> state between multiple programs call. It can be used to create a state
> >>> machine for an object evaluation.
> >>>
> >>> I don't see a way to do an efficient and programmatic path evaluation,
> >>> with different access checks, with the current eBPF features. Please let
> >>> me know if you know how to do it another way.
> >>>
> >>
> >> Andy, Alexei, Daniel, what do you think about this Landlock program
> >> chaining and cookie?
> >>
> > 
> > Can you give a small pseudocode real world example that acutally needs
> > chaining?  The mechanism is quite complicated and I'd like to
> > understand how it'll be used.
> > 
> 
> Here is the interesting part from the example (patch 09/11):
> 
> +SEC("maps")
> +struct bpf_map_def inode_map = {
> +	.type = BPF_MAP_TYPE_INODE,
> +	.key_size = sizeof(u32),
> +	.value_size = sizeof(u64),
> +	.max_entries = 20,
> +};
> +
> +SEC("subtype/landlock1")
> +static union bpf_prog_subtype _subtype1 = {
> +	.landlock_hook = {
> +		.type = LANDLOCK_HOOK_FS_WALK,
> +	}
> +};
> +
> +static __always_inline __u64 update_cookie(__u64 cookie, __u8 lookup,
> +		void *inode, void *chain, bool freeze)
> +{
> +	__u64 map_allow = 0;
> +
> +	if (cookie == 0) {
> +		cookie = bpf_inode_get_tag(inode, chain);
> +		if (cookie)
> +			return cookie;
> +		/* only look for the first match in the map, ignore nested
> +		 * paths in this example */
> +		map_allow = bpf_inode_map_lookup(&inode_map, inode);
> +		if (map_allow)
> +			cookie = 1 | map_allow;
> +	} else {
> +		if (cookie & COOKIE_VALUE_FREEZED)
> +			return cookie;
> +		map_allow = cookie & _MAP_MARK_MASK;
> +		cookie &= ~_MAP_MARK_MASK;
> +		switch (lookup) {
> +		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOTDOT:
> +			cookie--;
> +			break;
> +		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOT:
> +			break;
> +		default:
> +			/* ignore _MAP_MARK_MASK overflow in this example */
> +			cookie++;
> +			break;
> +		}
> +		if (cookie >= 1)
> +			cookie |= map_allow;
> +	}
> +	/* do not modify the cookie for each fs_pick */
> +	if (freeze && cookie)
> +		cookie |= COOKIE_VALUE_FREEZED;
> +	return cookie;
> +}
> +
> +SEC("landlock1")
> +int fs_walk(struct landlock_ctx_fs_walk *ctx)
> +{
> +	ctx->cookie = update_cookie(ctx->cookie, ctx->inode_lookup,
> +			(void *)ctx->inode, (void *)ctx->chain, false);
> +	return LANDLOCK_RET_ALLOW;
> +}
> 
> The program "landlock1" is called for every directory execution (except
> the last one if it is the leaf of a path). This enables to identify a
> file hierarchy with only a (one dimension) list of file descriptors
> (i.e. inode_map).
> 
> Underneath, the Landlock LSM part looks if there is an associated path
> walk (nameidata) with each inode access request. If there is one, then
> the cookie associated with the path walk (if any) is made available
> through the eBPF program context. This enables to develop a state
> machine with an eBPF program to "evaluate" a file path (without string
> parsing).
> 
> The goal with this chaining mechanism is to be able to express a complex
> kernel object like a file, with multiple run of one or more eBPF
> programs, as a multilayer evaluation. This semantic may only make sense
> for the user/developer and his security policy. We must keep in mind
> that this object identification should be available to unprivileged
> processes. This means that we must be very careful to what kind of
> information are available to an eBPF program because this can then leak
> to a process (e.g. through a map). With this mechanism, only information
> already available to user space is available to the eBPF program.
> 
> In this example, the complexity of the path evaluation is in the eBPF
> program. We can then keep the kernel code more simple and generic. This
> enables more flexibility for a security policy definition.

it all sounds correct on paper, but it's pretty novel
approach and I'm not sure I see all the details in the patch.
When people say "inode" they most of the time mean inode integer number,
whereas in this patch do you mean a raw pointer to in-kernel
'struct inode' ?
To avoid confusion it should probably be called differently.

If you meant inode as a number then why inode only?
where is superblock, device, mount point?
How bpf side can compare inodes without this additional info?
How bpf side will know what inode to compare to?
What if inode number is reused?
This approach is an optimization to compare inodes
instead of strings passed into sys_open ?

If you meant inode as a pointer how bpf side will
know the pointer before the walk begins?
What guarantees that it's not a stale pointer?
Mickaël Salaün April 11, 2018, 10:18 p.m. UTC | #15
On 04/10/2018 06:48 AM, Alexei Starovoitov wrote:
> On Mon, Apr 09, 2018 at 12:01:59AM +0200, Mickaël Salaün wrote:
>>
>> On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
>>> On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
>>>>>
>>>>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote:
>>>>>>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote:
>>>>>>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>>>>>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>>>>>>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>>>>>>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the
>>>>>>>>>>>> current task and all its future children. A program is immutable and a
>>>>>>>>>>>> task can only add new restricting programs to itself, forming a list of
>>>>>>>>>>>> programss.
>>>>>>>>>>>>
>>>>>>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>>>>>>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>>>>>>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of
>>>>>>>>>>>> object is triggered. The list of programs for this hook is then
>>>>>>>>>>>> evaluated. Each program return a 32-bit value which can deny the action
>>>>>>>>>>>> on a kernel object with a non-zero value. If every programs of the list
>>>>>>>>>>>> return zero, then the action on the object is allowed.
>>>>>>>>>>>>
>>>>>>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a
>>>>>>>>>>>> call chain (e.g. evaluating multiple elements of a file path).  This
>>>>>>>>>>>> chaining is restricted when a process construct this chain by loading a
>>>>>>>>>>>> program, but additional checks are performed when it requests to apply
>>>>>>>>>>>> this chain of programs to itself.  The restrictions ensure that it is
>>>>>>>>>>>> not possible to call multiple programs in a way that would imply to
>>>>>>>>>>>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>>>>>>>>>>>> only a fs_pick program can be chained to the same type of program,
>>>>>>>>>>>> because it may make sense if they have different triggers (cf. next
>>>>>>>>>>>> commits).  This restrictions still allows to reuse Landlock programs in
>>>>>>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple
>>>>>>>>>>>> chains of fs_pick programs).
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>> +struct landlock_prog_set *landlock_prepend_prog(
>>>>>>>>>>>> +             struct landlock_prog_set *current_prog_set,
>>>>>>>>>>>> +             struct bpf_prog *prog)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +     struct landlock_prog_set *new_prog_set = current_prog_set;
>>>>>>>>>>>> +     unsigned long pages;
>>>>>>>>>>>> +     int err;
>>>>>>>>>>>> +     size_t i;
>>>>>>>>>>>> +     struct landlock_prog_set tmp_prog_set = {};
>>>>>>>>>>>> +
>>>>>>>>>>>> +     if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>>>>>>>>>>>> +             return ERR_PTR(-EINVAL);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* validate memory size allocation */
>>>>>>>>>>>> +     pages = prog->pages;
>>>>>>>>>>>> +     if (current_prog_set) {
>>>>>>>>>>>> +             size_t i;
>>>>>>>>>>>> +
>>>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>>>>>>>>>>>> +                     struct landlock_prog_list *walker_p;
>>>>>>>>>>>> +
>>>>>>>>>>>> +                     for (walker_p = current_prog_set->programs[i];
>>>>>>>>>>>> +                                     walker_p; walker_p = walker_p->prev)
>>>>>>>>>>>> +                             pages += walker_p->prog->pages;
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +             /* count a struct landlock_prog_set if we need to allocate one */
>>>>>>>>>>>> +             if (refcount_read(&current_prog_set->usage) != 1)
>>>>>>>>>>>> +                     pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>>>>>>>>>>>> +                             / PAGE_SIZE;
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +     if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>>>>>>>>>>>> +             return ERR_PTR(-E2BIG);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* ensure early that we can allocate enough memory for the new
>>>>>>>>>>>> +      * prog_lists */
>>>>>>>>>>>> +     err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
>>>>>>>>>>>> +     if (err)
>>>>>>>>>>>> +             return ERR_PTR(err);
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /*
>>>>>>>>>>>> +      * Each task_struct points to an array of prog list pointers.  These
>>>>>>>>>>>> +      * tables are duplicated when additions are made (which means each
>>>>>>>>>>>> +      * table needs to be refcounted for the processes using it). When a new
>>>>>>>>>>>> +      * table is created, all the refcounters on the prog_list are bumped (to
>>>>>>>>>>>> +      * track each table that references the prog). When a new prog is
>>>>>>>>>>>> +      * added, it's just prepended to the list for the new table to point
>>>>>>>>>>>> +      * at.
>>>>>>>>>>>> +      *
>>>>>>>>>>>> +      * Manage all the possible errors before this step to not uselessly
>>>>>>>>>>>> +      * duplicate current_prog_set and avoid a rollback.
>>>>>>>>>>>> +      */
>>>>>>>>>>>> +     if (!new_prog_set) {
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * If there is no Landlock program set used by the current task,
>>>>>>>>>>>> +              * then create a new one.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>>>> +     } else if (refcount_read(&current_prog_set->usage) > 1) {
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * If the current task is not the sole user of its Landlock
>>>>>>>>>>>> +              * program set, then duplicate them.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             new_prog_set = new_landlock_prog_set();
>>>>>>>>>>>> +             if (IS_ERR(new_prog_set))
>>>>>>>>>>>> +                     goto put_tmp_lists;
>>>>>>>>>>>> +             for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>>>>>>>>>>>> +                     new_prog_set->programs[i] =
>>>>>>>>>>>> +                             READ_ONCE(current_prog_set->programs[i]);
>>>>>>>>>>>> +                     if (new_prog_set->programs[i])
>>>>>>>>>>>> +                             refcount_inc(&new_prog_set->programs[i]->usage);
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +
>>>>>>>>>>>> +             /*
>>>>>>>>>>>> +              * Landlock program set from the current task will not be freed
>>>>>>>>>>>> +              * here because the usage is strictly greater than 1. It is
>>>>>>>>>>>> +              * only prevented to be freed by another task thanks to the
>>>>>>>>>>>> +              * caller of landlock_prepend_prog() which should be locked if
>>>>>>>>>>>> +              * needed.
>>>>>>>>>>>> +              */
>>>>>>>>>>>> +             landlock_put_prog_set(current_prog_set);
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +
>>>>>>>>>>>> +     /* prepend tmp_prog_set to new_prog_set */
>>>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
>>>>>>>>>>>> +             /* get the last new list */
>>>>>>>>>>>> +             struct landlock_prog_list *last_list =
>>>>>>>>>>>> +                     tmp_prog_set.programs[i];
>>>>>>>>>>>> +
>>>>>>>>>>>> +             if (last_list) {
>>>>>>>>>>>> +                     while (last_list->prev)
>>>>>>>>>>>> +                             last_list = last_list->prev;
>>>>>>>>>>>> +                     /* no need to increment usage (pointer replacement) */
>>>>>>>>>>>> +                     last_list->prev = new_prog_set->programs[i];
>>>>>>>>>>>> +                     new_prog_set->programs[i] = tmp_prog_set.programs[i];
>>>>>>>>>>>> +             }
>>>>>>>>>>>> +     }
>>>>>>>>>>>> +     new_prog_set->chain_last = tmp_prog_set.chain_last;
>>>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>>>> +
>>>>>>>>>>>> +put_tmp_lists:
>>>>>>>>>>>> +     for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
>>>>>>>>>>>> +             put_landlock_prog_list(tmp_prog_set.programs[i]);
>>>>>>>>>>>> +     return new_prog_set;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> Nack on the chaining concept.
>>>>>>>>>>> Please do not reinvent the wheel.
>>>>>>>>>>> There is an existing mechanism for attaching/detaching/quering multiple
>>>>>>>>>>> programs attached to cgroup and tracing hooks that are also
>>>>>>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY.
>>>>>>>>>>> Please use that instead.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't see how that would help.  Suppose you add a filter, then
>>>>>>>>>> fork(), and then the child adds another filter.  Do you want to
>>>>>>>>>> duplicate the entire array?  You certainly can't *modify* the array
>>>>>>>>>> because you'll affect processes that shouldn't be affected.
>>>>>>>>>>
>>>>>>>>>> In contrast, doing this through seccomp like the earlier patches
>>>>>>>>>> seemed just fine to me, and seccomp already had the right logic.
>>>>>>>>>
>>>>>>>>> it doesn't look to me that existing seccomp side of managing fork
>>>>>>>>> situation can be reused. Here there is an attempt to add 'chaining'
>>>>>>>>> concept which sort of an extension of existing seccomp style,
>>>>>>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't see why the seccomp way can't be used.  I agree with you that
>>>>>>>> the seccomp *style* shouldn't be used in bpf code like this, but I
>>>>>>>> think that Landlock programs can and should just live in the existing
>>>>>>>> seccomp chain.  If the existing seccomp code needs some modification
>>>>>>>> to make this work, then so be it.
>>>>>>>
>>>>>>> +1
>>>>>>> if that was the case...
>>>>>>> but that's not my reading of the patch set.
>>>>>>
>>>>>> An earlier version of the patch set used the seccomp filter chain.
>>>>>> Mickaël, what exactly was wrong with that approach other than that the
>>>>>> seccomp() syscall was awkward for you to use?  You could add a
>>>>>> seccomp_add_landlock_rule() syscall if you needed to.
>>>>>
>>>>> Nothing was wrong about about that, this part did not changed (see my
>>>>> next comment).
>>>>>
>>>>>>
>>>>>> As a side comment, why is this an LSM at all, let alone a non-stacking
>>>>>> LSM?  It would make a lot more sense to me to make Landlock depend on
>>>>>> having LSMs configured in but to call the landlock hooks directly from
>>>>>> the security_xyz() hooks.
>>>>>
>>>>> See Casey's answer and his patch series: https://lwn.net/Articles/741963/
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> In other words, the kernel already has two kinds of chaining:
>>>>>>>> seccomp's and bpf's.  bpf's doesn't work right for this type of usage
>>>>>>>> across fork(), whereas seccomp's already handles that case correctly.
>>>>>>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.)
>>>>>>>>  So IMO Landlock should use the seccomp core code and call into bpf
>>>>>>>> for the actual filtering.
>>>>>>>
>>>>>>> +1
>>>>>>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism,
>>>>>>> since cgroup hierarchy can be complicated with bpf progs attached
>>>>>>> at different levels with different override/multiprog properties,
>>>>>>> so walking link list and checking all flags at run-time would have
>>>>>>> been too slow. That's why we added compute_effective_progs().
>>>>>>
>>>>>> If we start adding override flags to Landlock, I think we're doing it
>>>>>> wrong.   With cgroup bpf programs, the whole mess is set up by the
>>>>>> administrator.  With seccomp, and with Landlock if done correctly, it
>>>>>> *won't* be set up by the administrator, so the chance that everyone
>>>>>> gets all the flags right is about zero.  All attached filters should
>>>>>> run unconditionally.
>>>>>
>>>>>
>>>>> There is a misunderstanding about this chaining mechanism. This should
>>>>> not be confused with the list of seccomp filters nor the cgroup
>>>>> hierarchies. Landlock programs can be stacked the same way seccomp's
>>>>> filters can (cf. struct landlock_prog_set, the "chain_last" field is an
>>>>> optimization which is not used for this struct handling). This stackable
>>>>> property did not changed from the previous patch series. The chaining
>>>>> mechanism is for another use case, which does not make sense for seccomp
>>>>> filters nor other eBPF program types, at least for now, from what I can
>>>>> tell.
>>>>>
>>>>> You may want to get a look at my talk at FOSDEM
>>>>> (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially
>>>>> slides 11 and 12.
>>>>>
>>>>> Let me explain my reasoning about this program chaining thing.
>>>>>
>>>>> To check if an action on a file is allowed, we first need to identify
>>>>> this file and match it to the security policy. In a previous
>>>>> (non-public) patch series, I tried to use one type of eBPF program to
>>>>> check every kind of access to a file. To be able to identify a file, I
>>>>> relied on an eBPF map, similar to the current inode map. This map store
>>>>> a set of references to file descriptors. I then created a function
>>>>> bpf_is_file_beneath() to check if the requested file was beneath a file
>>>>> in the map. This way, no chaining, only one eBPF program type to check
>>>>> an access to a file... but some issues then emerged. First, this design
>>>>> create a side-channel which help an attacker using such a program to
>>>>> infer some information not normally available, for example to get a hint
>>>>> on where a file descriptor (received from a UNIX socket) come from.
>>>>> Another issue is that this type of program would be called for each
>>>>> component of a path. Indeed, when the kernel check if an access to a
>>>>> file is allowed, it walk through all of the directories in its path
>>>>> (checking if the current process is allowed to execute them). That first
>>>>> attempt led me to rethink the way we could filter an access to a file
>>>>> *path*.
>>>>>
>>>>> To minimize the number of called to an eBPF program dedicated to
>>>>> validate an access to a file path, I decided to create three subtype of
>>>>> eBPF programs. The FS_WALK type is called when walking through every
>>>>> directory of a file path (except the last one if it is the target). We
>>>>> can then restrict this type of program to the minimum set of functions
>>>>> it is allowed to call and the minimum set of data available from its
>>>>> context. The first implicit chaining is for this type of program. To be
>>>>> able to evaluate a path while being called for all its components, this
>>>>> program need to store a state (to remember what was the parent directory
>>>>> of this path). There is no "previous" field in the subtype for this
>>>>> program because it is chained with itself, for each directories. This
>>>>> enable to create a FS_WALK program to evaluate a file hierarchy, thank
>>>>> to the inode map which can be used to check if a directory of this
>>>>> hierarchy is part of an allowed (or denied) list of directories. This
>>>>> design enables to express a file hierarchy in a programmatic way,
>>>>> without requiring an eBPF helper to do the job (unlike my first experiment).
>>>>>
>>>>> The explicit chaining is used to tied a path evaluation (with a FS_WALK
>>>>> program) to an access to the actual file being requested (the last
>>>>> component of a file path), with a FS_PICK program. It is only at this
>>>>> time that the kernel check for the requested action (e.g. read, write,
>>>>> chdir, append...). To be able to filter such access request we can have
>>>>> one call to the same program for every action and let this program check
>>>>> for which action it was called. However, this design does not allow the
>>>>> kernel to know if the current action is indeed handled by this program.
>>>>> Hence, it is not possible to implement a cache mechanism to only call
>>>>> this program if it knows how to handle this action.
>>>>>
>>>>> The approach I took for this FS_PICK type of program is to add to its
>>>>> subtype which action it can handle (with the "triggers" bitfield, seen
>>>>> as ORed actions). This way, the kernel knows if a call to a FS_PICK
>>>>> program is necessary. If the user wants to enforce a different security
>>>>> policy according to the action requested on a file, then it needs
>>>>> multiple FS_PICK programs. However, to reduce the number of such
>>>>> programs, this patch series allow a FS_PICK program to be chained with
>>>>> another, the same way a FS_WALK is chained with itself. This way, if the
>>>>> user want to check if the action is a for example an "open" and a "read"
>>>>> and not a "map" and a "read", then it can chain multiple FS_PICK
>>>>> programs with different triggers actions. The OR check performed by the
>>>>> kernel is not a limitation then, only a way to know if a call to an eBPF
>>>>> program is needed.
>>>>>
>>>>> The last type of program is FS_GET. This one is called when a process
>>>>> get a struct file or change its working directory. This is the only
>>>>> program type able (and allowed) to tag a file. This restriction is
>>>>> important to not being subject to resource exhaustion attacks (i.e.
>>>>> tagging every inode accessible to an attacker, which would allocate too
>>>>> much kernel memory).
>>>>>
>>>>> This design gives room for improvements to create a cache of eBPF
>>>>> context (input data, including maps if any), with the result of an eBPF
>>>>> program. This would help limit the number of call to an eBPF program the
>>>>> same way SELinux or other kernel components do to limit costly checks.
>>>>>
>>>>> The eBPF maps of progs are useful to call the same type of eBPF
>>>>> program. It does not fit with this use case because we may want multiple
>>>>> eBPF program according to the action requested on a kernel object (e.g.
>>>>> FS_GET). The other reason is because the eBPF program does not know what
>>>>> will be the next (type of) access check performed by the kernel.
>>>>>
>>>>> To say it another way, this chaining mechanism is a way to split a
>>>>> kernel object evaluation with multiple specialized programs, each of
>>>>> them being able to deal with data tied to their type. Using a monolithic
>>>>> eBPF program to check everything does not scale and does not fit with
>>>>> unprivileged use either.
>>>>>
>>>>> As a side note, the cookie value is only an ephemeral value to keep a
>>>>> state between multiple programs call. It can be used to create a state
>>>>> machine for an object evaluation.
>>>>>
>>>>> I don't see a way to do an efficient and programmatic path evaluation,
>>>>> with different access checks, with the current eBPF features. Please let
>>>>> me know if you know how to do it another way.
>>>>>
>>>>
>>>> Andy, Alexei, Daniel, what do you think about this Landlock program
>>>> chaining and cookie?
>>>>
>>>
>>> Can you give a small pseudocode real world example that acutally needs
>>> chaining?  The mechanism is quite complicated and I'd like to
>>> understand how it'll be used.
>>>
>>
>> Here is the interesting part from the example (patch 09/11):
>>
>> +SEC("maps")
>> +struct bpf_map_def inode_map = {
>> +	.type = BPF_MAP_TYPE_INODE,
>> +	.key_size = sizeof(u32),
>> +	.value_size = sizeof(u64),
>> +	.max_entries = 20,
>> +};
>> +
>> +SEC("subtype/landlock1")
>> +static union bpf_prog_subtype _subtype1 = {
>> +	.landlock_hook = {
>> +		.type = LANDLOCK_HOOK_FS_WALK,
>> +	}
>> +};
>> +
>> +static __always_inline __u64 update_cookie(__u64 cookie, __u8 lookup,
>> +		void *inode, void *chain, bool freeze)
>> +{
>> +	__u64 map_allow = 0;
>> +
>> +	if (cookie == 0) {
>> +		cookie = bpf_inode_get_tag(inode, chain);
>> +		if (cookie)
>> +			return cookie;
>> +		/* only look for the first match in the map, ignore nested
>> +		 * paths in this example */
>> +		map_allow = bpf_inode_map_lookup(&inode_map, inode);
>> +		if (map_allow)
>> +			cookie = 1 | map_allow;
>> +	} else {
>> +		if (cookie & COOKIE_VALUE_FREEZED)
>> +			return cookie;
>> +		map_allow = cookie & _MAP_MARK_MASK;
>> +		cookie &= ~_MAP_MARK_MASK;
>> +		switch (lookup) {
>> +		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOTDOT:
>> +			cookie--;
>> +			break;
>> +		case LANDLOCK_CTX_FS_WALK_INODE_LOOKUP_DOT:
>> +			break;
>> +		default:
>> +			/* ignore _MAP_MARK_MASK overflow in this example */
>> +			cookie++;
>> +			break;
>> +		}
>> +		if (cookie >= 1)
>> +			cookie |= map_allow;
>> +	}
>> +	/* do not modify the cookie for each fs_pick */
>> +	if (freeze && cookie)
>> +		cookie |= COOKIE_VALUE_FREEZED;
>> +	return cookie;
>> +}
>> +
>> +SEC("landlock1")
>> +int fs_walk(struct landlock_ctx_fs_walk *ctx)
>> +{
>> +	ctx->cookie = update_cookie(ctx->cookie, ctx->inode_lookup,
>> +			(void *)ctx->inode, (void *)ctx->chain, false);
>> +	return LANDLOCK_RET_ALLOW;
>> +}
>>
>> The program "landlock1" is called for every directory execution (except
>> the last one if it is the leaf of a path). This enables to identify a
>> file hierarchy with only a (one dimension) list of file descriptors
>> (i.e. inode_map).
>>
>> Underneath, the Landlock LSM part looks if there is an associated path
>> walk (nameidata) with each inode access request. If there is one, then
>> the cookie associated with the path walk (if any) is made available
>> through the eBPF program context. This enables to develop a state
>> machine with an eBPF program to "evaluate" a file path (without string
>> parsing).
>>
>> The goal with this chaining mechanism is to be able to express a complex
>> kernel object like a file, with multiple run of one or more eBPF
>> programs, as a multilayer evaluation. This semantic may only make sense
>> for the user/developer and his security policy. We must keep in mind
>> that this object identification should be available to unprivileged
>> processes. This means that we must be very careful to what kind of
>> information are available to an eBPF program because this can then leak
>> to a process (e.g. through a map). With this mechanism, only information
>> already available to user space is available to the eBPF program.
>>
>> In this example, the complexity of the path evaluation is in the eBPF
>> program. We can then keep the kernel code more simple and generic. This
>> enables more flexibility for a security policy definition.
> 
> it all sounds correct on paper, but it's pretty novel
> approach and I'm not sure I see all the details in the patch.
> When people say "inode" they most of the time mean inode integer number,
> whereas in this patch do you mean a raw pointer to in-kernel
> 'struct inode' ?
> To avoid confusion it should probably be called differently.

It's indeed a pointer to a "struct inode", not an inode number.

I was thinking about generalizing the BPF_MAP_TYPE_INODE by renaming it
to BPF_MAP_TYPE_FD. This map type could then be used either to identify
a set of inodes (pointers) or other kernel objects identifiable by a
file descriptor. A "subtype" (similar to the BPF prog subtype introduced
in this patch series) may be used to specialize such a map to statically
identify the kind of content it may hold. We could then add more
subtypes to identify sockets, devices, processes, and so on.

> 
> If you meant inode as a number then why inode only?
> where is superblock, device, mount point?
> How bpf side can compare inodes without this additional info?
> How bpf side will know what inode to compare to?
> What if inode number is reused?

This pointer can identify if a giver inode is the same as one pointed by
a file descriptor (or a file path).


> This approach is an optimization to compare inodes
> instead of strings passed into sys_open ?

Comparing paths with strings is less efficient but it is also very
error-prone. Another advantage of using file descriptors is for
unprivileged processes: we can be sure that this processes are allowed
to access a file referred by a file descriptor (opened file). Indeed we
check (security_inode_getattr) that the process is allowed to stat an
opened file. This way, a malicious process can't infer information by
crafting path strings.


> 
> If you meant inode as a pointer how bpf side will
> know the pointer before the walk begins?

The BPF map is filled by user space with file descriptors pointing to
opened files. When a path walk begin, the LSM part of Landlock is
notified that a process is requesting an access to the first element of
the path (e.g. "/"). This first element may be part of a map or not. The
BPF program can then choose if this request is legitimate or not.


> What guarantees that it's not a stale pointer?

When user space updates a map with a new file descriptor, the kernel
checks if this FD is valid. If this is the case, then the inode's usage
counter is incremented and its address is stored in the map.
diff mbox

Patch

diff --git a/include/linux/landlock.h b/include/linux/landlock.h
new file mode 100644
index 000000000000..933d65c00075
--- /dev/null
+++ b/include/linux/landlock.h
@@ -0,0 +1,37 @@ 
+/*
+ * Landlock LSM - public kernel headers
+ *
+ * Copyright © 2016-2018 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018 ANSSI
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_LANDLOCK_H
+#define _LINUX_LANDLOCK_H
+
+#include <linux/errno.h>
+#include <linux/sched.h> /* task_struct */
+
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+extern int landlock_seccomp_prepend_prog(unsigned int flags,
+		const int __user *user_bpf_fd);
+extern void put_seccomp_landlock(struct task_struct *tsk);
+extern void get_seccomp_landlock(struct task_struct *tsk);
+#else /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+static inline int landlock_seccomp_prepend_prog(unsigned int flags,
+		const int __user *user_bpf_fd)
+{
+		return -EINVAL;
+}
+static inline void put_seccomp_landlock(struct task_struct *tsk)
+{
+}
+static inline void get_seccomp_landlock(struct task_struct *tsk)
+{
+}
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+
+#endif /* _LINUX_LANDLOCK_H */
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index c723a5c4e3ff..dedad0d5b664 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -9,6 +9,7 @@ 
 
 #ifdef CONFIG_SECCOMP
 
+#include <linux/landlock.h>
 #include <linux/thread_info.h>
 #include <asm/seccomp.h>
 
@@ -20,6 +21,7 @@  struct seccomp_filter;
  *         system calls available to a process.
  * @filter: must always point to a valid seccomp-filter or NULL as it is
  *          accessed without locking during system call entry.
+ * @landlock_prog_set: contains a set of Landlock programs.
  *
  *          @filter must only be accessed from the context of current as there
  *          is no read locking.
@@ -27,6 +29,9 @@  struct seccomp_filter;
 struct seccomp {
 	int mode;
 	struct seccomp_filter *filter;
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+	struct landlock_prog_set *landlock_prog_set;
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
 };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 2a0bd9dd104d..a4927638be82 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -15,6 +15,7 @@ 
 #define SECCOMP_SET_MODE_STRICT		0
 #define SECCOMP_SET_MODE_FILTER		1
 #define SECCOMP_GET_ACTION_AVAIL	2
+#define SECCOMP_PREPEND_LANDLOCK_PROG	3
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC	1
diff --git a/kernel/fork.c b/kernel/fork.c
index be8aa5b98666..5a5f8cbbfcb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -48,6 +48,7 @@ 
 #include <linux/security.h>
 #include <linux/hugetlb.h>
 #include <linux/seccomp.h>
+#include <linux/landlock.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -385,6 +386,7 @@  void free_task(struct task_struct *tsk)
 	rt_mutex_debug_task_free(tsk);
 	ftrace_graph_exit_task(tsk);
 	put_seccomp_filter(tsk);
+	put_seccomp_landlock(tsk);
 	arch_release_task_struct(tsk);
 	if (tsk->flags & PF_KTHREAD)
 		free_kthread_struct(tsk);
@@ -814,7 +816,10 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	 * the usage counts on the error path calling free_task.
 	 */
 	tsk->seccomp.filter = NULL;
-#endif
+#ifdef CONFIG_SECURITY_LANDLOCK
+	tsk->seccomp.landlock_prog_set = NULL;
+#endif /* CONFIG_SECURITY_LANDLOCK */
+#endif /* CONFIG_SECCOMP */
 
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
@@ -1496,6 +1501,7 @@  static void copy_seccomp(struct task_struct *p)
 
 	/* Ref-count the new filter user, and assign it. */
 	get_seccomp_filter(current);
+	get_seccomp_landlock(current);
 	p->seccomp = current->seccomp;
 
 	/*
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 940fa408a288..47a37f6c0dcd 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -37,6 +37,7 @@ 
 #include <linux/security.h>
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
+#include <linux/landlock.h>
 
 /**
  * struct seccomp_filter - container for seccomp BPF programs
@@ -932,6 +933,9 @@  static long do_seccomp(unsigned int op, unsigned int flags,
 			return -EINVAL;
 
 		return seccomp_get_action_avail(uargs);
+	case SECCOMP_PREPEND_LANDLOCK_PROG:
+		return landlock_seccomp_prepend_prog(flags,
+				(const int __user *)uargs);
 	default:
 		return -EINVAL;
 	}
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index 7205f9a7a2ee..05fce359028e 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -1,3 +1,4 @@ 
 obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
 
-landlock-y := init.o
+landlock-y := init.o chain.o \
+	enforce.o enforce_seccomp.o
diff --git a/security/landlock/chain.c b/security/landlock/chain.c
new file mode 100644
index 000000000000..805f4cb60e7e
--- /dev/null
+++ b/security/landlock/chain.c
@@ -0,0 +1,39 @@ 
+/*
+ * Landlock LSM - chain helpers
+ *
+ * Copyright © 2018 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018 ANSSI
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/refcount.h>
+#include <linux/slab.h>
+
+#include "chain.h"
+
+/* TODO: use a dedicated kmem_cache_alloc() instead of k*alloc() */
+
+/* never return NULL */
+struct landlock_chain *landlock_new_chain(u8 index)
+{
+	struct landlock_chain *chain;
+
+	chain = kzalloc(sizeof(*chain), GFP_KERNEL);
+	if (!chain)
+		return ERR_PTR(-ENOMEM);
+	chain->index = index;
+	refcount_set(&chain->usage, 1);
+	return chain;
+}
+
+void landlock_put_chain(struct landlock_chain *chain)
+{
+	if (!chain)
+		return;
+	if (!refcount_dec_and_test(&chain->usage))
+		return;
+	kfree(chain);
+}
diff --git a/security/landlock/chain.h b/security/landlock/chain.h
new file mode 100644
index 000000000000..a1497ee779a6
--- /dev/null
+++ b/security/landlock/chain.h
@@ -0,0 +1,35 @@ 
+/*
+ * Landlock LSM - chain headers
+ *
+ * Copyright © 2018 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018 ANSSI
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _SECURITY_LANDLOCK_CHAIN_H
+#define _SECURITY_LANDLOCK_CHAIN_H
+
+#include <linux/landlock.h> /* struct landlock_chain */
+#include <linux/refcount.h>
+
+/*
+ * @chain_index: index of the chain (defined by the user, different from a
+ *		 program list)
+ * @next: point to the next sibling in the same prog_set (used to match a chain
+ *	  against the current process)
+ * @index: index in the array dedicated to store data for a chain instance
+ */
+struct landlock_chain {
+	struct landlock_chain *next;
+	refcount_t usage;
+	u8 index;
+	u8 shared:1;
+};
+
+struct landlock_chain *landlock_new_chain(u8 index);
+void landlock_put_chain(struct landlock_chain *chain);
+
+#endif /* _SECURITY_LANDLOCK_CHAIN_H */
diff --git a/security/landlock/common.h b/security/landlock/common.h
index 0906678c0ed0..245e4ccafcf2 100644
--- a/security/landlock/common.h
+++ b/security/landlock/common.h
@@ -29,4 +29,57 @@ 
 #define _LANDLOCK_TRIGGER_FS_PICK_LAST	LANDLOCK_TRIGGER_FS_PICK_WRITE
 #define _LANDLOCK_TRIGGER_FS_PICK_MASK	((_LANDLOCK_TRIGGER_FS_PICK_LAST << 1ULL) - 1)
 
+struct landlock_chain;
+
+/*
+ * @is_last_of_type: in a chain of programs, it marks if this program is the
+ *		     last of its type
+ */
+struct landlock_prog_list {
+	struct landlock_prog_list *prev;
+	struct bpf_prog *prog;
+	struct landlock_chain *chain;
+	refcount_t usage;
+	u8 is_last_of_type:1;
+};
+
+/**
+ * struct landlock_prog_set - Landlock programs enforced on a thread
+ *
+ * This is used for low performance impact when forking a process. Instead of
+ * copying the full array and incrementing the usage of each entries, only
+ * create a pointer to &struct landlock_prog_set and increments its usage. When
+ * prepending a new program, if &struct landlock_prog_set is shared with other
+ * tasks, then duplicate it and prepend the program to this new &struct
+ * landlock_prog_set.
+ *
+ * @usage: reference count to manage the object lifetime. When a thread need to
+ *	   add Landlock programs and if @usage is greater than 1, then the
+ *	   thread must duplicate &struct landlock_prog_set to not change the
+ *	   children's programs as well.
+ * @chain_last: chain of the last prepended program
+ * @programs: array of non-NULL &struct landlock_prog_list pointers
+ */
+struct landlock_prog_set {
+	struct landlock_chain *chain_last;
+	struct landlock_prog_list *programs[_LANDLOCK_HOOK_LAST];
+	refcount_t usage;
+};
+
+/**
+ * get_index - get an index for the programs of struct landlock_prog_set
+ *
+ * @type: a Landlock hook type
+ */
+static inline int get_index(enum landlock_hook_type type)
+{
+	/* type ID > 0 for loaded programs */
+	return type - 1;
+}
+
+static inline enum landlock_hook_type get_type(struct bpf_prog *prog)
+{
+	return prog->aux->extra->subtype.landlock_hook.type;
+}
+
 #endif /* _SECURITY_LANDLOCK_COMMON_H */
diff --git a/security/landlock/enforce.c b/security/landlock/enforce.c
new file mode 100644
index 000000000000..8846cfd9aff7
--- /dev/null
+++ b/security/landlock/enforce.c
@@ -0,0 +1,386 @@ 
+/*
+ * Landlock LSM - enforcing helpers
+ *
+ * Copyright © 2016-2018 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018 ANSSI
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#include <asm/barrier.h> /* smp_store_release() */
+#include <asm/page.h> /* PAGE_SIZE */
+#include <linux/bpf.h> /* bpf_prog_put() */
+#include <linux/compiler.h> /* READ_ONCE() */
+#include <linux/err.h> /* PTR_ERR() */
+#include <linux/errno.h>
+#include <linux/filter.h> /* struct bpf_prog */
+#include <linux/refcount.h>
+#include <linux/slab.h> /* alloc(), kfree() */
+
+#include "chain.h"
+#include "common.h" /* struct landlock_prog_list */
+
+/* TODO: use a dedicated kmem_cache_alloc() instead of k*alloc() */
+
+static void put_landlock_prog_list(struct landlock_prog_list *prog_list)
+{
+	struct landlock_prog_list *orig = prog_list;
+
+	/* clean up single-reference branches iteratively */
+	while (orig && refcount_dec_and_test(&orig->usage)) {
+		struct landlock_prog_list *freeme = orig;
+
+		if (orig->prog)
+			bpf_prog_put(orig->prog);
+		landlock_put_chain(orig->chain);
+		orig = orig->prev;
+		kfree(freeme);
+	}
+}
+
+void landlock_put_prog_set(struct landlock_prog_set *prog_set)
+{
+	if (prog_set && refcount_dec_and_test(&prog_set->usage)) {
+		size_t i;
+
+		for (i = 0; i < ARRAY_SIZE(prog_set->programs); i++)
+			put_landlock_prog_list(prog_set->programs[i]);
+		landlock_put_chain(prog_set->chain_last);
+		kfree(prog_set);
+	}
+}
+
+void landlock_get_prog_set(struct landlock_prog_set *prog_set)
+{
+	struct landlock_chain *chain;
+
+	if (!prog_set)
+		return;
+	refcount_inc(&prog_set->usage);
+	chain = prog_set->chain_last;
+	/* mark all inherited chains as (potentially) shared */
+	while (chain && !chain->shared) {
+		chain->shared = 1;
+		chain = chain->next;
+	}
+}
+
+static struct landlock_prog_set *new_landlock_prog_set(void)
+{
+	struct landlock_prog_set *ret;
+
+	/* array filled with NULL values */
+	ret = kzalloc(sizeof(*ret), GFP_KERNEL);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
+	refcount_set(&ret->usage, 1);
+	return ret;
+}
+
+/*
+ * If a program type is able to fork, this means that there is one amongst
+ * multiple programs (types) that may be called after, depending on the action
+ * type. This means that if a (sub)type has a "triggers" field (e.g. fs_pick),
+ * then it is forkable.
+ *
+ * Keep in sync with init.c:good_previous_prog().
+ */
+static bool is_hook_type_forkable(enum landlock_hook_type hook_type)
+{
+	switch (hook_type) {
+	case LANDLOCK_HOOK_FS_WALK:
+		return false;
+	case LANDLOCK_HOOK_FS_PICK:
+		/* can fork to fs_get or fs_ioctl... */
+		return true;
+	case LANDLOCK_HOOK_FS_GET:
+		return false;
+	}
+	WARN_ON(1);
+	return false;
+}
+
+/**
+ * store_landlock_prog - prepend and deduplicate a Landlock prog_list
+ *
+ * Prepend @prog to @init_prog_set while ignoring @prog and its chained programs
+ * if they are already in @ref_prog_set.  Whatever is the result of this
+ * function call, you can call bpf_prog_put(@prog) after.
+ *
+ * @init_prog_set: empty prog_set to prepend to
+ * @ref_prog_set: prog_set to check for duplicate programs
+ * @prog: program chain to prepend
+ *
+ * Return -errno on error or 0 if @prog was successfully stored.
+ */
+static int store_landlock_prog(struct landlock_prog_set *init_prog_set,
+		const struct landlock_prog_set *ref_prog_set,
+		struct bpf_prog *prog)
+{
+	struct landlock_prog_list *tmp_list = NULL;
+	int err;
+	u32 hook_idx;
+	bool new_is_last_of_type;
+	bool first = true;
+	struct landlock_chain *chain = NULL;
+	enum landlock_hook_type last_type;
+	struct bpf_prog *new = prog;
+
+	/* allocate all the memory we need */
+	for (; new; new = new->aux->extra->landlock_hook.previous) {
+		bool ignore = false;
+		struct landlock_prog_list *new_list;
+
+		new_is_last_of_type = first || (last_type != get_type(new));
+		last_type = get_type(new);
+		first = false;
+		/* ignore duplicate programs */
+		if (ref_prog_set) {
+			struct landlock_prog_list *ref;
+			struct bpf_prog *new_prev;
+
+			/*
+			 * The subtype verifier has already checked the
+			 * coherency of the program types chained in @new (cf.
+			 * good_previous_prog).
+			 *
+			 * Here we only allow linking to a chain if the common
+			 * program's type is able to fork (e.g. fs_pick) and
+			 * come from the same task (i.e. not shared).  This
+			 * program must also be the last one of its type in
+			 * both the @ref and the @new chains.  Finally, two
+			 * programs with the same parent must be of different
+			 * type.
+			 */
+			if (WARN_ON(!new->aux->extra))
+				continue;
+			new_prev = new->aux->extra->landlock_hook.previous;
+			hook_idx = get_index(get_type(new));
+			for (ref = ref_prog_set->programs[hook_idx];
+					ref; ref = ref->prev) {
+				struct bpf_prog *ref_prev;
+
+				ignore = (ref->prog == new);
+				if (ignore)
+					break;
+				ref_prev = ref->prog->aux->extra->
+					landlock_hook.previous;
+				/* deny fork to the same types */
+				if (new_prev && new_prev == ref_prev) {
+					err = -EINVAL;
+					goto put_tmp_list;
+				}
+			}
+			/* remaining programs are already in ref_prog_set */
+			if (ignore) {
+				bool is_forkable =
+					is_hook_type_forkable(get_type(new));
+
+				if (ref->chain->shared || !is_forkable ||
+						!new_is_last_of_type ||
+						!ref->is_last_of_type) {
+					err = -EINVAL;
+					goto put_tmp_list;
+				}
+				/* use the same session (i.e. cookie state) */
+				chain = ref->chain;
+				/* will increment the usage counter later */
+				break;
+			}
+		}
+
+		new = bpf_prog_inc(new);
+		if (IS_ERR(new)) {
+			err = PTR_ERR(new);
+			goto put_tmp_list;
+		}
+		new_list = kzalloc(sizeof(*new_list), GFP_KERNEL);
+		if (!new_list) {
+			bpf_prog_put(new);
+			err = -ENOMEM;
+			goto put_tmp_list;
+		}
+		/* ignore Landlock types in this tmp_list */
+		new_list->is_last_of_type = new_is_last_of_type;
+		new_list->prog = new;
+		new_list->prev = tmp_list;
+		refcount_set(&new_list->usage, 1);
+		tmp_list = new_list;
+	}
+
+	if (!tmp_list)
+		/* inform user space that this program was already added */
+		return -EEXIST;
+
+	if (!chain) {
+		u8 chain_index;
+
+		if (ref_prog_set) {
+			/* this is a new independent chain */
+			chain_index = ref_prog_set->chain_last->index + 1;
+			/* check for integer overflow */
+			if (chain_index < ref_prog_set->chain_last->index) {
+				err = -E2BIG;
+				goto put_tmp_list;
+			}
+		} else {
+			chain_index = 0;
+		}
+		chain = landlock_new_chain(chain_index);
+		if (IS_ERR(chain)) {
+			err = PTR_ERR(chain);
+			goto put_tmp_list;
+		}
+		/* no need to refcount_dec(&init_prog_set->chain_last) */
+	}
+	init_prog_set->chain_last = chain;
+
+	/* properly store the list (without error cases) */
+	while (tmp_list) {
+		struct landlock_prog_list *new_list;
+
+		new_list = tmp_list;
+		tmp_list = tmp_list->prev;
+		/* do not increment the previous prog list usage */
+		hook_idx = get_index(get_type(new_list->prog));
+		new_list->prev = init_prog_set->programs[hook_idx];
+		new_list->chain = chain;
+		refcount_inc(&chain->usage);
+		/* no need to add from the last program to the first because
+		 * each of them are a different Landlock type */
+		smp_store_release(&init_prog_set->programs[hook_idx], new_list);
+	}
+	return 0;
+
+put_tmp_list:
+	put_landlock_prog_list(tmp_list);
+	return err;
+}
+
+/* limit Landlock programs set to 256KB */
+#define LANDLOCK_PROGRAMS_MAX_PAGES (1 << 6)
+
+/**
+ * landlock_prepend_prog - attach a Landlock prog_list to @current_prog_set
+ *
+ * Whatever is the result of this function call, you can call
+ * bpf_prog_put(@prog) after.
+ *
+ * @current_prog_set: landlock_prog_set pointer, must be locked (if needed) to
+ *                    prevent a concurrent put/free. This pointer must not be
+ *                    freed after the call.
+ * @prog: non-NULL Landlock prog_list to prepend to @current_prog_set. @prog
+ *	  will be owned by landlock_prepend_prog() and freed if an error
+ *	  happened.
+ *
+ * Return @current_prog_set or a new pointer when OK. Return a pointer error
+ * otherwise.
+ */
+struct landlock_prog_set *landlock_prepend_prog(
+		struct landlock_prog_set *current_prog_set,
+		struct bpf_prog *prog)
+{
+	struct landlock_prog_set *new_prog_set = current_prog_set;
+	unsigned long pages;
+	int err;
+	size_t i;
+	struct landlock_prog_set tmp_prog_set = {};
+
+	if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
+		return ERR_PTR(-EINVAL);
+
+	/* validate memory size allocation */
+	pages = prog->pages;
+	if (current_prog_set) {
+		size_t i;
+
+		for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
+			struct landlock_prog_list *walker_p;
+
+			for (walker_p = current_prog_set->programs[i];
+					walker_p; walker_p = walker_p->prev)
+				pages += walker_p->prog->pages;
+		}
+		/* count a struct landlock_prog_set if we need to allocate one */
+		if (refcount_read(&current_prog_set->usage) != 1)
+			pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
+				/ PAGE_SIZE;
+	}
+	if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
+		return ERR_PTR(-E2BIG);
+
+	/* ensure early that we can allocate enough memory for the new
+	 * prog_lists */
+	err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog);
+	if (err)
+		return ERR_PTR(err);
+
+	/*
+	 * Each task_struct points to an array of prog list pointers.  These
+	 * tables are duplicated when additions are made (which means each
+	 * table needs to be refcounted for the processes using it). When a new
+	 * table is created, all the refcounters on the prog_list are bumped (to
+	 * track each table that references the prog). When a new prog is
+	 * added, it's just prepended to the list for the new table to point
+	 * at.
+	 *
+	 * Manage all the possible errors before this step to not uselessly
+	 * duplicate current_prog_set and avoid a rollback.
+	 */
+	if (!new_prog_set) {
+		/*
+		 * If there is no Landlock program set used by the current task,
+		 * then create a new one.
+		 */
+		new_prog_set = new_landlock_prog_set();
+		if (IS_ERR(new_prog_set))
+			goto put_tmp_lists;
+	} else if (refcount_read(&current_prog_set->usage) > 1) {
+		/*
+		 * If the current task is not the sole user of its Landlock
+		 * program set, then duplicate them.
+		 */
+		new_prog_set = new_landlock_prog_set();
+		if (IS_ERR(new_prog_set))
+			goto put_tmp_lists;
+		for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
+			new_prog_set->programs[i] =
+				READ_ONCE(current_prog_set->programs[i]);
+			if (new_prog_set->programs[i])
+				refcount_inc(&new_prog_set->programs[i]->usage);
+		}
+
+		/*
+		 * Landlock program set from the current task will not be freed
+		 * here because the usage is strictly greater than 1. It is
+		 * only prevented to be freed by another task thanks to the
+		 * caller of landlock_prepend_prog() which should be locked if
+		 * needed.
+		 */
+		landlock_put_prog_set(current_prog_set);
+	}
+
+	/* prepend tmp_prog_set to new_prog_set */
+	for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) {
+		/* get the last new list */
+		struct landlock_prog_list *last_list =
+			tmp_prog_set.programs[i];
+
+		if (last_list) {
+			while (last_list->prev)
+				last_list = last_list->prev;
+			/* no need to increment usage (pointer replacement) */
+			last_list->prev = new_prog_set->programs[i];
+			new_prog_set->programs[i] = tmp_prog_set.programs[i];
+		}
+	}
+	new_prog_set->chain_last = tmp_prog_set.chain_last;
+	return new_prog_set;
+
+put_tmp_lists:
+	for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++)
+		put_landlock_prog_list(tmp_prog_set.programs[i]);
+	return new_prog_set;
+}
diff --git a/security/landlock/enforce.h b/security/landlock/enforce.h
new file mode 100644
index 000000000000..27de15d4ca3e
--- /dev/null
+++ b/security/landlock/enforce.h
@@ -0,0 +1,21 @@ 
+/*
+ * Landlock LSM - enforcing helpers headers
+ *
+ * Copyright © 2016-2018 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018 ANSSI
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef _SECURITY_LANDLOCK_ENFORCE_H
+#define _SECURITY_LANDLOCK_ENFORCE_H
+
+struct landlock_prog_set *landlock_prepend_prog(
+		struct landlock_prog_set *current_prog_set,
+		struct bpf_prog *prog);
+void landlock_put_prog_set(struct landlock_prog_set *prog_set);
+void landlock_get_prog_set(struct landlock_prog_set *prog_set);
+
+#endif /* _SECURITY_LANDLOCK_ENFORCE_H */
diff --git a/security/landlock/enforce_seccomp.c b/security/landlock/enforce_seccomp.c
new file mode 100644
index 000000000000..8da72e868422
--- /dev/null
+++ b/security/landlock/enforce_seccomp.c
@@ -0,0 +1,102 @@ 
+/*
+ * Landlock LSM - enforcing with seccomp
+ *
+ * Copyright © 2016-2018 Mickaël Salaün <mic@digikod.net>
+ * Copyright © 2018 ANSSI
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifdef CONFIG_SECCOMP_FILTER
+
+#include <linux/bpf.h> /* bpf_prog_put() */
+#include <linux/capability.h>
+#include <linux/err.h> /* PTR_ERR() */
+#include <linux/errno.h>
+#include <linux/filter.h> /* struct bpf_prog */
+#include <linux/landlock.h>
+#include <linux/refcount.h>
+#include <linux/sched.h> /* current */
+#include <linux/uaccess.h> /* get_user() */
+
+#include "enforce.h"
+
+/* headers in include/linux/landlock.h */
+
+/**
+ * landlock_seccomp_prepend_prog - attach a Landlock program to the current
+ *                                 process
+ *
+ * current->seccomp.landlock_state->prog_set is lazily allocated. When a
+ * process fork, only a pointer is copied.  When a new program is added by a
+ * process, if there is other references to this process' prog_set, then a new
+ * allocation is made to contain an array pointing to Landlock program lists.
+ * This design enable low-performance impact and is memory efficient while
+ * keeping the property of prepend-only programs.
+ *
+ * For now, installing a Landlock prog requires that the requesting task has
+ * the global CAP_SYS_ADMIN. We cannot force the use of no_new_privs to not
+ * exclude containers where a process may legitimately acquire more privileges
+ * thanks to an SUID binary.
+ *
+ * @flags: not used for now, but could be used for TSYNC
+ * @user_bpf_fd: file descriptor pointing to a loaded Landlock prog
+ */
+int landlock_seccomp_prepend_prog(unsigned int flags,
+		const int __user *user_bpf_fd)
+{
+	struct landlock_prog_set *new_prog_set;
+	struct bpf_prog *prog;
+	int bpf_fd, err;
+
+	/* planned to be replaced with a no_new_privs check to allow
+	 * unprivileged tasks */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	/* enable to check if Landlock is supported with early EFAULT */
+	if (!user_bpf_fd)
+		return -EFAULT;
+	if (flags)
+		return -EINVAL;
+	err = get_user(bpf_fd, user_bpf_fd);
+	if (err)
+		return err;
+
+	prog = bpf_prog_get(bpf_fd);
+	if (IS_ERR(prog)) {
+		err = PTR_ERR(prog);
+		goto free_task;
+	}
+
+	/*
+	 * We don't need to lock anything for the current process hierarchy,
+	 * everything is guarded by the atomic counters.
+	 */
+	new_prog_set = landlock_prepend_prog(
+			current->seccomp.landlock_prog_set, prog);
+	bpf_prog_put(prog);
+	/* @prog is managed/freed by landlock_prepend_prog() */
+	if (IS_ERR(new_prog_set)) {
+		err = PTR_ERR(new_prog_set);
+		goto free_task;
+	}
+	current->seccomp.landlock_prog_set = new_prog_set;
+	return 0;
+
+free_task:
+	return err;
+}
+
+void put_seccomp_landlock(struct task_struct *tsk)
+{
+	landlock_put_prog_set(tsk->seccomp.landlock_prog_set);
+}
+
+void get_seccomp_landlock(struct task_struct *tsk)
+{
+	landlock_get_prog_set(tsk->seccomp.landlock_prog_set);
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */