diff mbox

[v5,06/10] seccomp,landlock: Handle Landlock events per process hierarchy

Message ID 20170222012632.4196-7-mic@digikod.net (mailing list archive)
State New, archived
Headers show

Commit Message

Mickaël Salaün Feb. 22, 2017, 1:26 a.m. UTC
The seccomp(2) syscall can be use to apply a Landlock rule to the
current process. As with a seccomp filter, the Landlock rule is enforced
for all its future children. An inherited rule tree can be updated
(append-only) by the owner of inherited Landlock nodes (e.g. a parent
process that create a new rule). However, an intermediate task, which
did not create a rule, will not be able to update its children's rules.

Landlock rules can be tied to a Landlock event. When such an event is
triggered, a tree of rules can be evaluated. Thisk kind of tree is
created with a first node.  This node reference a list of rules and an
optional parent node. Each rule return a 32-bit value which can
interrupt the evaluation with a non-zero value. If every rules returned
zero, the evaluation continues with the rule list of the parent node,
until the end of the tree.

Changes since v4:
* merge manager and seccomp patches
* return -EFAULT in seccomp(2) when user_bpf_fd is null to easely check
  if Landlock is supported
* only allow a process with the global CAP_SYS_ADMIN to use Landlock
  (will be lifted in the future)
* add an early check to exit as soon as possible if the current process
  does not have Landlock rules

Changes since v3:
* remove the hard link with seccomp (suggested by Andy Lutomirski and
  Kees Cook):
  * remove the cookie which could imply multiple evaluation of Landlock
    rules
  * remove the origin field in struct landlock_data
* remove documentation fix (merged upstream)
* rename the new seccomp command to SECCOMP_ADD_LANDLOCK_RULE
* internal renaming
* split commit
* new design to be able to inherit on the fly the parent rules

Changes since v2:
* Landlock programs can now be run without seccomp filter but for any
  syscall (from the process) or interruption
* move Landlock related functions and structs into security/landlock/*
  (to manage cgroups as well)
* fix seccomp filter handling: run Landlock programs for each of their
  legitimate seccomp filter
* properly clean up all seccomp results
* cosmetic changes to ease the understanding
* fix some ifdef

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Will Drewry <wad@chromium.org>
---
 include/linux/seccomp.h      |   8 ++
 include/uapi/linux/seccomp.h |   1 +
 kernel/fork.c                |  14 +-
 kernel/seccomp.c             |   8 ++
 security/landlock/Makefile   |   2 +-
 security/landlock/hooks.c    |  42 +++++-
 security/landlock/manager.c  | 321 +++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 392 insertions(+), 4 deletions(-)
 create mode 100644 security/landlock/manager.c

Comments

Andy Lutomirski Feb. 28, 2017, 8:01 p.m. UTC | #1
On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <mic@digikod.net> wrote:
> The seccomp(2) syscall can be use to apply a Landlock rule to the
> current process. As with a seccomp filter, the Landlock rule is enforced
> for all its future children. An inherited rule tree can be updated
> (append-only) by the owner of inherited Landlock nodes (e.g. a parent
> process that create a new rule)

Can you clarify exaclty what this type of update does?  Is it
something that should be supported by normal seccomp rules as well?

> +/**
> + * landlock_run_prog - run Landlock program for a syscall

Unless this is actually specific to syscalls, s/for a syscall//, perhaps?

> +               if (new_events->nodes[event_idx]->owner ==
> +                               &new_events->nodes[event_idx]) {
> +                       /* We are the owner, we can then update the node. */
> +                       add_landlock_rule(new_events, rule);

This is the part I don't get.  Adding a rule if you're the owner (BTW,
why is ownership visible to userspace at all?) for just yourself and
future children is very different from adding it so it applies to
preexisting children too.


> +               } else if (atomic_read(&current_events->usage) == 1) {
> +                       WARN_ON(new_events->nodes[event_idx]->owner);
> +                       /*
> +                        * We can become the new owner if no other task use it.
> +                        * This avoid an unnecessary allocation.
> +                        */
> +                       new_events->nodes[event_idx]->owner =
> +                               &new_events->nodes[event_idx];
> +                       add_landlock_rule(new_events, rule);
> +               } else {
> +                       /*
> +                        * We are not the owner, we need to fork current_events
> +                        * and then add a new node.
> +                        */
> +                       struct landlock_node *node;
> +                       size_t i;
> +
> +                       node = kmalloc(sizeof(*node), GFP_KERNEL);
> +                       if (!node) {
> +                               new_events = ERR_PTR(-ENOMEM);
> +                               goto put_rule;
> +                       }
> +                       atomic_set(&node->usage, 1);
> +                       /* set the previous node after the new_events
> +                        * allocation */
> +                       node->prev = NULL;
> +                       /* do not increment the previous node usage */
> +                       node->owner = &new_events->nodes[event_idx];
> +                       /* rule->prev is already NULL */
> +                       atomic_set(&rule->usage, 1);
> +                       node->rule = rule;
> +
> +                       new_events = new_raw_landlock_events();
> +                       if (IS_ERR(new_events)) {
> +                               /* put the rule as well */
> +                               put_landlock_node(node);
> +                               return ERR_PTR(-ENOMEM);
> +                       }
> +                       for (i = 0; i < ARRAY_SIZE(new_events->nodes); i++) {
> +                               new_events->nodes[i] =
> +                                       lockless_dereference(
> +                                                       current_events->nodes[i]);
> +                               if (i == event_idx)
> +                                       node->prev = new_events->nodes[i];
> +                               if (!WARN_ON(!new_events->nodes[i]))
> +                                       atomic_inc(&new_events->nodes[i]->usage);
> +                       }
> +                       new_events->nodes[event_idx] = node;
> +
> +                       /*
> +                        * @current_events will not be freed here because it's usage
> +                        * field is > 1. It is only prevented to be freed by another
> +                        * subject thanks to the caller of landlock_append_prog() which
> +                        * should be locked if needed.
> +                        */
> +                       put_landlock_events(current_events);
> +               }
> +       }
> +       return new_events;
> +
> +put_prog:
> +       bpf_prog_put(prog);
> +       return new_events;
> +
> +put_rule:
> +       put_landlock_rule(rule);
> +       return new_events;
> +}
> +
> +/**
> + * landlock_seccomp_append_prog - attach a Landlock rule to the current process
> + *
> + * current->seccomp.landlock_events is lazily allocated. When a process fork,
> + * only a pointer is copied. When a new event is added by a process, if there
> + * is other references to this process' landlock_events, then a new allocation
> + * is made to contains an array pointing to Landlock rule lists. This design
> + * has low-performance impact and is memory efficient while keeping the
> + * property of append-only rules.
> + *
> + * @flags: not used for now, but could be used for TSYNC
> + * @user_bpf_fd: file descriptor pointing to a loaded Landlock rule
> + */
> +#ifdef CONFIG_SECCOMP_FILTER
> +int landlock_seccomp_append_prog(unsigned int flags, const char __user *user_bpf_fd)
> +{
> +       struct landlock_events *new_events;
> +       struct bpf_prog *prog;
> +       int bpf_fd;
> +
> +       /* force no_new_privs to limit privilege escalation */
> +       if (!task_no_new_privs(current))
> +               return -EPERM;
> +       /* will be removed in the future to allow unprivileged tasks */
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EPERM;
> +       if (!user_bpf_fd)
> +               return -EFAULT;
> +       if (flags)
> +               return -EINVAL;
> +       if (copy_from_user(&bpf_fd, user_bpf_fd, sizeof(bpf_fd)))
> +               return -EFAULT;
> +       prog = bpf_prog_get(bpf_fd);
> +       if (IS_ERR(prog))
> +               return PTR_ERR(prog);
> +
> +       /*
> +        * We don't need to lock anything for the current process hierarchy,
> +        * everything is guarded by the atomic counters.
> +        */
> +       new_events = landlock_append_prog(current->seccomp.landlock_events, prog);

Do you need to check that it's the right *kind* of bpf prog or is that
handled elsewhere?

--Andy
Mickaël Salaün March 1, 2017, 10:14 p.m. UTC | #2
On 28/02/2017 21:01, Andy Lutomirski wrote:
> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <mic@digikod.net> wrote:
>> The seccomp(2) syscall can be use to apply a Landlock rule to the
>> current process. As with a seccomp filter, the Landlock rule is enforced
>> for all its future children. An inherited rule tree can be updated
>> (append-only) by the owner of inherited Landlock nodes (e.g. a parent
>> process that create a new rule)
> 
> Can you clarify exaclty what this type of update does?  Is it
> something that should be supported by normal seccomp rules as well?

There is two main structures involved here: struct landlock_node and
struct landlock_rule, both defined in include/linux/landlock.h [02/10].

Let's take an example with seccomp filter and then Landlock:
* seccomp filter: Process P1 creates and applies a seccomp filter F1 to
itself. Then it forks and creates a child P2, which inherits P1's
filters, hence F1. Now, if P1 add a new seccomp filter F2 to itself, P2
*won't get it*. The P2's filter list will still only contains F1 but not
F2. If P2 sets up and applies a new filter F3 to itself, its filter list
will contains F1 and F3.
* Landlock: Process P1 creates and applies a Landlock rule R1 to itself.
Underneath the kernel creates a new node N1 dedicated to P1, which
contains all its rules. Then P1 forks and creates a child P2, which
inherits P1's rules, hence R1. Underneath P2 inherited N1. Now, if P1
add a new Landlock rule R2 to itself, P2 *will get it* as well (because
R2 is part of N1). If P2 creates and applies a new rule R3 to itself,
its rules will contains R1, R2 and R3. Underneath the kernel created a
new node N2 for P2, which only contains R3 but inherits/links to N1.

This design makes it possible for a process to add more constraints to
its children on the fly. I think it is a good feature to have and a
safer default inheritance mechanism, but it could be guarded by an
option flag if we want both mechanism to be available. The same design
could be used by seccomp filter too.


> 
>> +/**
>> + * landlock_run_prog - run Landlock program for a syscall
> 
> Unless this is actually specific to syscalls, s/for a syscall//, perhaps?

Right, not specific to syscall anymore.

> 
>> +               if (new_events->nodes[event_idx]->owner ==
>> +                               &new_events->nodes[event_idx]) {
>> +                       /* We are the owner, we can then update the node. */
>> +                       add_landlock_rule(new_events, rule);
> 
> This is the part I don't get.  Adding a rule if you're the owner (BTW,
> why is ownership visible to userspace at all?) for just yourself and
> future children is very different from adding it so it applies to
> preexisting children too.

Node ownership is not (directly) visible to userspace.

The current inheritance mechanism doesn't enable to only add a rule to
the current process. The rule will be inherited by its children
(starting from the children created after the first applied rule). An
option flag NEW_RULE_HIERARCHY (or maybe another seccomp operation)
could enable to create a new node for the current process, and then
makes it not inherited by the previous children.


> 
> 
>> +               } else if (atomic_read(&current_events->usage) == 1) {
>> +                       WARN_ON(new_events->nodes[event_idx]->owner);
>> +                       /*
>> +                        * We can become the new owner if no other task use it.
>> +                        * This avoid an unnecessary allocation.
>> +                        */
>> +                       new_events->nodes[event_idx]->owner =
>> +                               &new_events->nodes[event_idx];
>> +                       add_landlock_rule(new_events, rule);
>> +               } else {
>> +                       /*
>> +                        * We are not the owner, we need to fork current_events
>> +                        * and then add a new node.
>> +                        */
>> +                       struct landlock_node *node;
>> +                       size_t i;
>> +
>> +                       node = kmalloc(sizeof(*node), GFP_KERNEL);
>> +                       if (!node) {
>> +                               new_events = ERR_PTR(-ENOMEM);
>> +                               goto put_rule;
>> +                       }
>> +                       atomic_set(&node->usage, 1);
>> +                       /* set the previous node after the new_events
>> +                        * allocation */
>> +                       node->prev = NULL;
>> +                       /* do not increment the previous node usage */
>> +                       node->owner = &new_events->nodes[event_idx];
>> +                       /* rule->prev is already NULL */
>> +                       atomic_set(&rule->usage, 1);
>> +                       node->rule = rule;
>> +
>> +                       new_events = new_raw_landlock_events();
>> +                       if (IS_ERR(new_events)) {
>> +                               /* put the rule as well */
>> +                               put_landlock_node(node);
>> +                               return ERR_PTR(-ENOMEM);
>> +                       }
>> +                       for (i = 0; i < ARRAY_SIZE(new_events->nodes); i++) {
>> +                               new_events->nodes[i] =
>> +                                       lockless_dereference(
>> +                                                       current_events->nodes[i]);
>> +                               if (i == event_idx)
>> +                                       node->prev = new_events->nodes[i];
>> +                               if (!WARN_ON(!new_events->nodes[i]))
>> +                                       atomic_inc(&new_events->nodes[i]->usage);
>> +                       }
>> +                       new_events->nodes[event_idx] = node;
>> +
>> +                       /*
>> +                        * @current_events will not be freed here because it's usage
>> +                        * field is > 1. It is only prevented to be freed by another
>> +                        * subject thanks to the caller of landlock_append_prog() which
>> +                        * should be locked if needed.
>> +                        */
>> +                       put_landlock_events(current_events);
>> +               }
>> +       }
>> +       return new_events;
>> +
>> +put_prog:
>> +       bpf_prog_put(prog);
>> +       return new_events;
>> +
>> +put_rule:
>> +       put_landlock_rule(rule);
>> +       return new_events;
>> +}
>> +
>> +/**
>> + * landlock_seccomp_append_prog - attach a Landlock rule to the current process
>> + *
>> + * current->seccomp.landlock_events is lazily allocated. When a process fork,
>> + * only a pointer is copied. When a new event is added by a process, if there
>> + * is other references to this process' landlock_events, then a new allocation
>> + * is made to contains an array pointing to Landlock rule lists. This design
>> + * has low-performance impact and is memory efficient while keeping the
>> + * property of append-only rules.
>> + *
>> + * @flags: not used for now, but could be used for TSYNC
>> + * @user_bpf_fd: file descriptor pointing to a loaded Landlock rule
>> + */
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +int landlock_seccomp_append_prog(unsigned int flags, const char __user *user_bpf_fd)
>> +{
>> +       struct landlock_events *new_events;
>> +       struct bpf_prog *prog;
>> +       int bpf_fd;
>> +
>> +       /* force no_new_privs to limit privilege escalation */
>> +       if (!task_no_new_privs(current))
>> +               return -EPERM;
>> +       /* will be removed in the future to allow unprivileged tasks */
>> +       if (!capable(CAP_SYS_ADMIN))
>> +               return -EPERM;
>> +       if (!user_bpf_fd)
>> +               return -EFAULT;
>> +       if (flags)
>> +               return -EINVAL;
>> +       if (copy_from_user(&bpf_fd, user_bpf_fd, sizeof(bpf_fd)))
>> +               return -EFAULT;
>> +       prog = bpf_prog_get(bpf_fd);
>> +       if (IS_ERR(prog))
>> +               return PTR_ERR(prog);
>> +
>> +       /*
>> +        * We don't need to lock anything for the current process hierarchy,
>> +        * everything is guarded by the atomic counters.
>> +        */
>> +       new_events = landlock_append_prog(current->seccomp.landlock_events, prog);
> 
> Do you need to check that it's the right *kind* of bpf prog or is that
> handled elsewhere?

The program type is checked at the beginning of landlock_append_prog().

 Mickaël
Andy Lutomirski March 1, 2017, 10:20 p.m. UTC | #3
On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 28/02/2017 21:01, Andy Lutomirski wrote:
>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>> The seccomp(2) syscall can be use to apply a Landlock rule to the
>>> current process. As with a seccomp filter, the Landlock rule is enforced
>>> for all its future children. An inherited rule tree can be updated
>>> (append-only) by the owner of inherited Landlock nodes (e.g. a parent
>>> process that create a new rule)
>>
>> Can you clarify exaclty what this type of update does?  Is it
>> something that should be supported by normal seccomp rules as well?
>
> There is two main structures involved here: struct landlock_node and
> struct landlock_rule, both defined in include/linux/landlock.h [02/10].
>
> Let's take an example with seccomp filter and then Landlock:
> * seccomp filter: Process P1 creates and applies a seccomp filter F1 to
> itself. Then it forks and creates a child P2, which inherits P1's
> filters, hence F1. Now, if P1 add a new seccomp filter F2 to itself, P2
> *won't get it*. The P2's filter list will still only contains F1 but not
> F2. If P2 sets up and applies a new filter F3 to itself, its filter list
> will contains F1 and F3.
> * Landlock: Process P1 creates and applies a Landlock rule R1 to itself.
> Underneath the kernel creates a new node N1 dedicated to P1, which
> contains all its rules. Then P1 forks and creates a child P2, which
> inherits P1's rules, hence R1. Underneath P2 inherited N1. Now, if P1
> add a new Landlock rule R2 to itself, P2 *will get it* as well (because
> R2 is part of N1). If P2 creates and applies a new rule R3 to itself,
> its rules will contains R1, R2 and R3. Underneath the kernel created a
> new node N2 for P2, which only contains R3 but inherits/links to N1.
>
> This design makes it possible for a process to add more constraints to
> its children on the fly. I think it is a good feature to have and a
> safer default inheritance mechanism, but it could be guarded by an
> option flag if we want both mechanism to be available. The same design
> could be used by seccomp filter too.
>

Then let's do it right.

Currently each task has an array of seccomp filter layers.  When a
task forks, the child inherits the layers.  All the layers are
presently immutable.  With Landlock, a layer can logically be a
syscall fitler layer or a Landlock layer.  This fits in to the
existing model just fine.

If we want to have an interface to allow modification of an existing
layer, let's make it so that, when a layer is added, you have to
specify a flag to make the layer modifiable (by current, presumably,
although I can imagine other policies down the road).  Then have a
separate API that modifies a layer.

IOW, I think your patch is bad for three reasons, all fixable:

1. The default is wrong.  A layer should be immutable to avoid an easy
attack in which you try to sandbox *yourself* and then you just modify
the layer to weaken it.

2. The API that adds a layer should be different from the API that
modifies a layer.

3. The whole modification mechanism should be a separate patch to be
reviewed on its own merits.

> The current inheritance mechanism doesn't enable to only add a rule to
> the current process. The rule will be inherited by its children
> (starting from the children created after the first applied rule). An
> option flag NEW_RULE_HIERARCHY (or maybe another seccomp operation)
> could enable to create a new node for the current process, and then
> makes it not inherited by the previous children.

I like my proposal above much better.  "Add a layer" and "change a
layer" should be different operations.

--Andy
Mickaël Salaün March 1, 2017, 11:28 p.m. UTC | #4
On 01/03/2017 23:20, Andy Lutomirski wrote:
> On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 28/02/2017 21:01, Andy Lutomirski wrote:
>>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>> The seccomp(2) syscall can be use to apply a Landlock rule to the
>>>> current process. As with a seccomp filter, the Landlock rule is enforced
>>>> for all its future children. An inherited rule tree can be updated
>>>> (append-only) by the owner of inherited Landlock nodes (e.g. a parent
>>>> process that create a new rule)
>>>
>>> Can you clarify exaclty what this type of update does?  Is it
>>> something that should be supported by normal seccomp rules as well?
>>
>> There is two main structures involved here: struct landlock_node and
>> struct landlock_rule, both defined in include/linux/landlock.h [02/10].
>>
>> Let's take an example with seccomp filter and then Landlock:
>> * seccomp filter: Process P1 creates and applies a seccomp filter F1 to
>> itself. Then it forks and creates a child P2, which inherits P1's
>> filters, hence F1. Now, if P1 add a new seccomp filter F2 to itself, P2
>> *won't get it*. The P2's filter list will still only contains F1 but not
>> F2. If P2 sets up and applies a new filter F3 to itself, its filter list
>> will contains F1 and F3.
>> * Landlock: Process P1 creates and applies a Landlock rule R1 to itself.
>> Underneath the kernel creates a new node N1 dedicated to P1, which
>> contains all its rules. Then P1 forks and creates a child P2, which
>> inherits P1's rules, hence R1. Underneath P2 inherited N1. Now, if P1
>> add a new Landlock rule R2 to itself, P2 *will get it* as well (because
>> R2 is part of N1). If P2 creates and applies a new rule R3 to itself,
>> its rules will contains R1, R2 and R3. Underneath the kernel created a
>> new node N2 for P2, which only contains R3 but inherits/links to N1.
>>
>> This design makes it possible for a process to add more constraints to
>> its children on the fly. I think it is a good feature to have and a
>> safer default inheritance mechanism, but it could be guarded by an
>> option flag if we want both mechanism to be available. The same design
>> could be used by seccomp filter too.
>>
> 
> Then let's do it right.
> 
> Currently each task has an array of seccomp filter layers.  When a
> task forks, the child inherits the layers.  All the layers are
> presently immutable.  With Landlock, a layer can logically be a
> syscall fitler layer or a Landlock layer.  This fits in to the
> existing model just fine.
> 
> If we want to have an interface to allow modification of an existing
> layer, let's make it so that, when a layer is added, you have to
> specify a flag to make the layer modifiable (by current, presumably,
> although I can imagine other policies down the road).  Then have a
> separate API that modifies a layer.
> 
> IOW, I think your patch is bad for three reasons, all fixable:
> 
> 1. The default is wrong.  A layer should be immutable to avoid an easy
> attack in which you try to sandbox *yourself* and then you just modify
> the layer to weaken it.

This is not possible, there is only an operation for now:
SECCOMP_ADD_LANDLOCK_RULE. You can only add more rules to the list (as
for seccomp filter). There is no way to weaken a sandbox. The question
is: how do we want to handle the rules *tree* (from the kernel point of
view)?

> 
> 2. The API that adds a layer should be different from the API that
> modifies a layer.

Right, but it doesn't apply now because we can only add rules.

> 
> 3. The whole modification mechanism should be a separate patch to be
> reviewed on its own merits.

For a rule *replacement*, sure!

> 
>> The current inheritance mechanism doesn't enable to only add a rule to
>> the current process. The rule will be inherited by its children
>> (starting from the children created after the first applied rule). An
>> option flag NEW_RULE_HIERARCHY (or maybe another seccomp operation)
>> could enable to create a new node for the current process, and then
>> makes it not inherited by the previous children.
> 
> I like my proposal above much better.  "Add a layer" and "change a
> layer" should be different operations.

I agree, but for now it's about how to handle immutable (but growing)
inherited rules.
Djalal Harouni March 2, 2017, 10:22 a.m. UTC | #5
On Wed, Feb 22, 2017 at 2:26 AM, Mickaël Salaün <mic@digikod.net> wrote:
> The seccomp(2) syscall can be use to apply a Landlock rule to the
> current process. As with a seccomp filter, the Landlock rule is enforced
> for all its future children. An inherited rule tree can be updated
> (append-only) by the owner of inherited Landlock nodes (e.g. a parent
> process that create a new rule). However, an intermediate task, which
> did not create a rule, will not be able to update its children's rules.
>
> Landlock rules can be tied to a Landlock event. When such an event is
> triggered, a tree of rules can be evaluated. Thisk kind of tree is
> created with a first node.  This node reference a list of rules and an
> optional parent node. Each rule return a 32-bit value which can
> interrupt the evaluation with a non-zero value. If every rules returned
> zero, the evaluation continues with the rule list of the parent node,
> until the end of the tree.
>
> Changes since v4:
> * merge manager and seccomp patches
> * return -EFAULT in seccomp(2) when user_bpf_fd is null to easely check
>   if Landlock is supported
> * only allow a process with the global CAP_SYS_ADMIN to use Landlock
>   (will be lifted in the future)
> * add an early check to exit as soon as possible if the current process
>   does not have Landlock rules
>
> Changes since v3:
> * remove the hard link with seccomp (suggested by Andy Lutomirski and
>   Kees Cook):
>   * remove the cookie which could imply multiple evaluation of Landlock
>     rules
>   * remove the origin field in struct landlock_data
> * remove documentation fix (merged upstream)
> * rename the new seccomp command to SECCOMP_ADD_LANDLOCK_RULE
> * internal renaming
> * split commit
> * new design to be able to inherit on the fly the parent rules
>
> Changes since v2:
> * Landlock programs can now be run without seccomp filter but for any
>   syscall (from the process) or interruption
> * move Landlock related functions and structs into security/landlock/*
>   (to manage cgroups as well)
> * fix seccomp filter handling: run Landlock programs for each of their
>   legitimate seccomp filter
> * properly clean up all seccomp results
> * cosmetic changes to ease the understanding
> * fix some ifdef
>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: James Morris <james.l.morris@oracle.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Serge E. Hallyn <serge@hallyn.com>
> Cc: Will Drewry <wad@chromium.org>
> ---
>  include/linux/seccomp.h      |   8 ++
>  include/uapi/linux/seccomp.h |   1 +
>  kernel/fork.c                |  14 +-
>  kernel/seccomp.c             |   8 ++
>  security/landlock/Makefile   |   2 +-
>  security/landlock/hooks.c    |  42 +++++-
>  security/landlock/manager.c  | 321 +++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 392 insertions(+), 4 deletions(-)
>  create mode 100644 security/landlock/manager.c
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index e25aee2cdfc0..9a38de3c0e72 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -10,6 +10,10 @@
>  #include <linux/thread_info.h>
>  #include <asm/seccomp.h>
>
> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
> +struct landlock_events;
> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
> +
>  struct seccomp_filter;
>  /**
>   * struct seccomp - the state of a seccomp'ed process
> @@ -18,6 +22,7 @@ struct seccomp_filter;
>   *         system calls available to a process.
>   * @filter: must always point to a valid seccomp-filter or NULL as it is
>   *          accessed without locking during system call entry.
> + * @landlock_events: contains an array of Landlock rules.
>   *
>   *          @filter must only be accessed from the context of current as there
>   *          is no read locking.
> @@ -25,6 +30,9 @@ struct seccomp_filter;
>  struct seccomp {
>         int mode;
>         struct seccomp_filter *filter;
> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
> +       struct landlock_events *landlock_events;
> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
>  };
>
>  #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 0f238a43ff1e..56dd692cddac 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -13,6 +13,7 @@
>  /* Valid operations for seccomp syscall. */
>  #define SECCOMP_SET_MODE_STRICT        0
>  #define SECCOMP_SET_MODE_FILTER        1
> +#define SECCOMP_ADD_LANDLOCK_RULE      2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>  #define SECCOMP_FILTER_FLAG_TSYNC      1
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a4f0d0e8aeb2..bd5c72dffe60 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -37,6 +37,7 @@
>  #include <linux/security.h>
>  #include <linux/hugetlb.h>
>  #include <linux/seccomp.h>
> +#include <linux/landlock.h>
>  #include <linux/swap.h>
>  #include <linux/syscalls.h>
>  #include <linux/jiffies.h>
> @@ -515,7 +516,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>          * the usage counts on the error path calling free_task.
>          */
>         tsk->seccomp.filter = NULL;
> -#endif
> +#ifdef CONFIG_SECURITY_LANDLOCK
> +       tsk->seccomp.landlock_events = NULL;
> +#endif /* CONFIG_SECURITY_LANDLOCK */
> +#endif /* CONFIG_SECCOMP */
>
>         setup_thread_stack(tsk, orig);
>         clear_user_return_notifier(tsk);
> @@ -1388,7 +1392,13 @@ static void copy_seccomp(struct task_struct *p)
>
>         /* Ref-count the new filter user, and assign it. */
>         get_seccomp_filter(current);
> -       p->seccomp = current->seccomp;
> +       p->seccomp.mode = current->seccomp.mode;
> +       p->seccomp.filter = current->seccomp.filter;
> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
> +       p->seccomp.landlock_events = current->seccomp.landlock_events;
> +       if (p->seccomp.landlock_events)
> +               atomic_inc(&p->seccomp.landlock_events->usage);
> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
>
>         /*
>          * Explicitly enable no_new_privs here in case it got set
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 06f2f3ee454c..ef412d95ff5d 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -32,6 +32,7 @@
>  #include <linux/security.h>
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
> +#include <linux/landlock.h>
>
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
> @@ -492,6 +493,9 @@ static void put_seccomp_filter(struct seccomp_filter *filter)
>  void put_seccomp(struct task_struct *tsk)
>  {
>         put_seccomp_filter(tsk->seccomp.filter);
> +#ifdef CONFIG_SECURITY_LANDLOCK
> +       put_landlock_events(tsk->seccomp.landlock_events);
> +#endif /* CONFIG_SECURITY_LANDLOCK */
>  }
>
>  /**
> @@ -796,6 +800,10 @@ static long do_seccomp(unsigned int op, unsigned int flags,
>                 return seccomp_set_mode_strict();
>         case SECCOMP_SET_MODE_FILTER:
>                 return seccomp_set_mode_filter(flags, uargs);
> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
> +       case SECCOMP_ADD_LANDLOCK_RULE:
> +               return landlock_seccomp_append_prog(flags, uargs);
> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
>         default:
>                 return -EINVAL;
>         }
> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
> index 8dc8bde660bd..6c1b0d8bd810 100644
> --- a/security/landlock/Makefile
> +++ b/security/landlock/Makefile
> @@ -2,4 +2,4 @@ ccflags-$(CONFIG_SECURITY_LANDLOCK) += -Werror=unused-function
>
>  obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>
> -landlock-y := hooks.o
> +landlock-y := hooks.o manager.o
> diff --git a/security/landlock/hooks.c b/security/landlock/hooks.c
> index 88ebe3f01758..704ea18377d2 100644
> --- a/security/landlock/hooks.c
> +++ b/security/landlock/hooks.c
> @@ -290,7 +290,44 @@ static u64 mem_prot_to_access(unsigned long prot, bool private)
>
>  static inline bool landlock_used(void)
>  {
> +#ifdef CONFIG_SECCOMP_FILTER
> +       return !!(current->seccomp.landlock_events);
> +#else
>         return false;
> +#endif /* CONFIG_SECCOMP_FILTER */
> +}
> +
> +/**
> + * landlock_run_prog - run Landlock program for a syscall
> + *
> + * @event_idx: event index in the rules array
> + * @ctx: non-NULL eBPF context
> + * @events: Landlock events pointer
> + */
> +static int landlock_run_prog(u32 event_idx, const struct landlock_context *ctx,
> +               struct landlock_events *events)
> +{
> +       struct landlock_node *node;
> +
> +       if (!events)
> +               return 0;
> +
> +       for (node = events->nodes[event_idx]; node; node = node->prev) {
> +               struct landlock_rule *rule;
> +
> +               for (rule = node->rule; rule; rule = rule->prev) {
> +                       u32 ret;
> +
> +                       if (WARN_ON(!rule->prog))
> +                               continue;
> +                       rcu_read_lock();
> +                       ret = BPF_PROG_RUN(rule->prog, (void *)ctx);
> +                       rcu_read_unlock();
> +                       if (ret)
> +                               return -EPERM;
> +               }
> +       }
> +       return 0;
>  }
>
>  static int landlock_decide(enum landlock_subtype_event event,
> @@ -309,7 +346,10 @@ static int landlock_decide(enum landlock_subtype_event event,
>                 .arg2 = ctx_values[1],
>         };
>
> -       /* insert manager call here */
> +#ifdef CONFIG_SECCOMP_FILTER
> +       ret = landlock_run_prog(event_idx, &ctx,
> +                       current->seccomp.landlock_events);
> +#endif /* CONFIG_SECCOMP_FILTER */
>
>         return ret;
>  }
> diff --git a/security/landlock/manager.c b/security/landlock/manager.c
> new file mode 100644
> index 000000000000..00bb2944c85e
> --- /dev/null
> +++ b/security/landlock/manager.c
> @@ -0,0 +1,321 @@
> +/*
> + * Landlock LSM - seccomp manager
> + *
> + * Copyright © 2017 Mickaël Salaün <mic@digikod.net>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2, as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <asm/page.h> /* PAGE_SIZE */
> +#include <linux/atomic.h> /* atomic_*(), smp_store_release() */
> +#include <linux/bpf.h> /* bpf_prog_put() */
> +#include <linux/filter.h> /* struct bpf_prog */
> +#include <linux/kernel.h> /* round_up() */
> +#include <linux/landlock.h>
> +#include <linux/sched.h> /* current_cred(), task_no_new_privs() */
> +#include <linux/security.h> /* security_capable_noaudit() */
> +#include <linux/slab.h> /* alloc(), kfree() */
> +#include <linux/types.h> /* atomic_t */
> +#include <linux/uaccess.h> /* copy_from_user() */
> +
> +#include "common.h"
> +
> +static void put_landlock_rule(struct landlock_rule *rule)
> +{
> +       struct landlock_rule *orig = rule;
> +
> +       /* clean up single-reference branches iteratively */
> +       while (orig && atomic_dec_and_test(&orig->usage)) {
> +               struct landlock_rule *freeme = orig;
> +
> +               bpf_prog_put(orig->prog);
> +               orig = orig->prev;
> +               kfree(freeme);
> +       }
> +}
> +
> +static void put_landlock_node(struct landlock_node *node)
> +{
> +       struct landlock_node *orig = node;
> +
> +       /* clean up single-reference branches iteratively */
> +       while (orig && atomic_dec_and_test(&orig->usage)) {
> +               struct landlock_node *freeme = orig;
> +
> +               put_landlock_rule(orig->rule);
> +               orig = orig->prev;
> +               kfree(freeme);
> +       }
> +}
> +
> +void put_landlock_events(struct landlock_events *events)
> +{
> +       if (events && atomic_dec_and_test(&events->usage)) {
> +               size_t i;
> +
> +               /* XXX: Do we need to use lockless_dereference() here? */
> +               for (i = 0; i < ARRAY_SIZE(events->nodes); i++) {
> +                       if (!events->nodes[i])
> +                               continue;
> +                       /* Are we the owner of this node? */
> +                       if (events->nodes[i]->owner == &events->nodes[i])
> +                               events->nodes[i]->owner = NULL;
> +                       put_landlock_node(events->nodes[i]);
> +               }
> +               kfree(events);
> +       }
> +}
> +
> +static struct landlock_events *new_raw_landlock_events(void)
> +{
> +       struct landlock_events *ret;
> +
> +       /* array filled with NULL values */
> +       ret = kzalloc(sizeof(*ret), GFP_KERNEL);
> +       if (!ret)
> +               return ERR_PTR(-ENOMEM);
> +       atomic_set(&ret->usage, 1);
> +       return ret;
> +}
> +
> +static struct landlock_events *new_filled_landlock_events(void)
> +{
> +       size_t i;
> +       struct landlock_events *ret;
> +
> +       ret = new_raw_landlock_events();
> +       if (IS_ERR(ret))
> +               return ret;
> +       /*
> +        * We need to initially allocate every nodes to be able to update the
> +        * rules they are pointing to, across every (future) children of the
> +        * current task.
> +        */
> +       for (i = 0; i < ARRAY_SIZE(ret->nodes); i++) {
> +               struct landlock_node *node;
> +
> +               node = kzalloc(sizeof(*node), GFP_KERNEL);
> +               if (!node)
> +                       goto put_events;
> +               atomic_set(&node->usage, 1);
> +               /* we are the owner of this node */
> +               node->owner = &ret->nodes[i];
> +               ret->nodes[i] = node;
> +       }
> +       return ret;
> +
> +put_events:
> +       put_landlock_events(ret);
> +       return ERR_PTR(-ENOMEM);
> +}
> +
> +static void add_landlock_rule(struct landlock_events *events,
> +               struct landlock_rule *rule)
> +{
> +       /* subtype.landlock_rule.event > 0 for loaded programs */
> +       u32 event_idx = get_index(rule->prog->subtype.landlock_rule.event);
> +
> +       rule->prev = events->nodes[event_idx]->rule;
> +       WARN_ON(atomic_read(&rule->usage));
> +       atomic_set(&rule->usage, 1);
> +       /* do not increment the previous rule usage */
> +       smp_store_release(&events->nodes[event_idx]->rule, rule);
> +}
> +
> +/* Limit Landlock events to 256KB. */
> +#define LANDLOCK_EVENTS_MAX_PAGES (1 << 6)
> +
> +/**
> + * landlock_append_prog - attach a Landlock rule to @current_events
> + *
> + * @current_events: landlock_events pointer, must be locked (if needed) to
> + *                  prevent a concurrent put/free. This pointer must not be
> + *                  freed after the call.
> + * @prog: non-NULL Landlock rule to append to @current_events. @prog will be
> + *        owned by landlock_append_prog() and freed if an error happened.
> + *
> + * Return @current_events or a new pointer when OK. Return a pointer error
> + * otherwise.
> + */
> +static struct landlock_events *landlock_append_prog(
> +               struct landlock_events *current_events, struct bpf_prog *prog)
> +{
> +       struct landlock_events *new_events = current_events;
> +       unsigned long pages;
> +       struct landlock_rule *rule;
> +       u32 event_idx;
> +
> +       if (prog->type != BPF_PROG_TYPE_LANDLOCK) {
> +               new_events = ERR_PTR(-EINVAL);
> +               goto put_prog;
> +       }
> +
> +       /* validate memory size allocation */
> +       pages = prog->pages;
> +       if (current_events) {
> +               size_t i;
> +
> +               for (i = 0; i < ARRAY_SIZE(current_events->nodes); i++) {
> +                       struct landlock_node *walker_n;
> +
> +                       for (walker_n = current_events->nodes[i];
> +                                       walker_n;
> +                                       walker_n = walker_n->prev) {
> +                               struct landlock_rule *walker_r;
> +
> +                               for (walker_r = walker_n->rule;
> +                                               walker_r;
> +                                               walker_r = walker_r->prev)
> +                                       pages += walker_r->prog->pages;
> +                       }
> +               }
> +               /* count a struct landlock_events if we need to allocate one */
> +               if (atomic_read(&current_events->usage) != 1)
> +                       pages += round_up(sizeof(*current_events), PAGE_SIZE) /
> +                               PAGE_SIZE;
> +       }
> +       if (pages > LANDLOCK_EVENTS_MAX_PAGES) {
> +               new_events = ERR_PTR(-E2BIG);
> +               goto put_prog;
> +       }
> +
> +       rule = kzalloc(sizeof(*rule), GFP_KERNEL);
> +       if (!rule) {
> +               new_events = ERR_PTR(-ENOMEM);
> +               goto put_prog;
> +       }
> +       rule->prog = prog;
> +
> +       /* subtype.landlock_rule.event > 0 for loaded programs */
> +       event_idx = get_index(rule->prog->subtype.landlock_rule.event);
> +
> +       if (!current_events) {
> +               /* add a new landlock_events, if needed */
> +               new_events = new_filled_landlock_events();
> +               if (IS_ERR(new_events))
> +                       goto put_rule;
> +               add_landlock_rule(new_events, rule);
> +       } else {
> +               if (new_events->nodes[event_idx]->owner ==
> +                               &new_events->nodes[event_idx]) {
> +                       /* We are the owner, we can then update the node. */
> +                       add_landlock_rule(new_events, rule);
> +               } else if (atomic_read(&current_events->usage) == 1) {
> +                       WARN_ON(new_events->nodes[event_idx]->owner);
> +                       /*
> +                        * We can become the new owner if no other task use it.
> +                        * This avoid an unnecessary allocation.
> +                        */
> +                       new_events->nodes[event_idx]->owner =
> +                               &new_events->nodes[event_idx];
> +                       add_landlock_rule(new_events, rule);

I still don't get it all, but maybe here to make it simple you have to
always allocate, since the WARN_ON() suggests it should be scheduled
to be removed... and avoid to care about whether you are using/freeing
old rules of the dead task... ?


> +               } else {
> +                       /*
> +                        * We are not the owner, we need to fork current_events
> +                        * and then add a new node.
> +                        */
> +                       struct landlock_node *node;
> +                       size_t i;
> +
> +                       node = kmalloc(sizeof(*node), GFP_KERNEL);
> +                       if (!node) {
> +                               new_events = ERR_PTR(-ENOMEM);
> +                               goto put_rule;
> +                       }
> +                       atomic_set(&node->usage, 1);
> +                       /* set the previous node after the new_events
> +                        * allocation */
> +                       node->prev = NULL;
> +                       /* do not increment the previous node usage */
> +                       node->owner = &new_events->nodes[event_idx];
> +                       /* rule->prev is already NULL */
> +                       atomic_set(&rule->usage, 1);
> +                       node->rule = rule;
> +
> +                       new_events = new_raw_landlock_events();
> +                       if (IS_ERR(new_events)) {
> +                               /* put the rule as well */
> +                               put_landlock_node(node);
> +                               return ERR_PTR(-ENOMEM);
> +                       }
> +                       for (i = 0; i < ARRAY_SIZE(new_events->nodes); i++) {
> +                               new_events->nodes[i] =
> +                                       lockless_dereference(
> +                                                       current_events->nodes[i]);
> +                               if (i == event_idx)
> +                                       node->prev = new_events->nodes[i];
> +                               if (!WARN_ON(!new_events->nodes[i]))
> +                                       atomic_inc(&new_events->nodes[i]->usage);
> +                       }
> +                       new_events->nodes[event_idx] = node;
> +
> +                       /*
> +                        * @current_events will not be freed here because it's usage
> +                        * field is > 1. It is only prevented to be freed by another
> +                        * subject thanks to the caller of landlock_append_prog() which
> +                        * should be locked if needed.
> +                        */
> +                       put_landlock_events(current_events);
> +               }
> +       }
> +       return new_events;
> +
> +put_prog:
> +       bpf_prog_put(prog);
> +       return new_events;
> +
> +put_rule:
> +       put_landlock_rule(rule);
> +       return new_events;
> +}
> +
> +/**
> + * landlock_seccomp_append_prog - attach a Landlock rule to the current process
> + *
> + * current->seccomp.landlock_events is lazily allocated. When a process fork,
> + * only a pointer is copied. When a new event is added by a process, if there
> + * is other references to this process' landlock_events, then a new allocation
> + * is made to contains an array pointing to Landlock rule lists. This design
> + * has low-performance impact and is memory efficient while keeping the
> + * property of append-only rules.
> + *
> + * @flags: not used for now, but could be used for TSYNC
> + * @user_bpf_fd: file descriptor pointing to a loaded Landlock rule
> + */
> +#ifdef CONFIG_SECCOMP_FILTER
> +int landlock_seccomp_append_prog(unsigned int flags, const char __user *user_bpf_fd)
> +{
> +       struct landlock_events *new_events;
> +       struct bpf_prog *prog;
> +       int bpf_fd;
> +
> +       /* force no_new_privs to limit privilege escalation */
> +       if (!task_no_new_privs(current))
> +               return -EPERM;
> +       /* will be removed in the future to allow unprivileged tasks */
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EPERM;
> +       if (!user_bpf_fd)
> +               return -EFAULT;
> +       if (flags)
> +               return -EINVAL;
> +       if (copy_from_user(&bpf_fd, user_bpf_fd, sizeof(bpf_fd)))
> +               return -EFAULT;
> +       prog = bpf_prog_get(bpf_fd);
> +       if (IS_ERR(prog))
> +               return PTR_ERR(prog);
> +
> +       /*
> +        * We don't need to lock anything for the current process hierarchy,
> +        * everything is guarded by the atomic counters.
> +        */
> +       new_events = landlock_append_prog(current->seccomp.landlock_events, prog);
> +       /* @prog is managed/freed by landlock_append_prog() */
> +       if (IS_ERR(new_events))
> +               return PTR_ERR(new_events);
> +       current->seccomp.landlock_events = new_events;
> +       return 0;
> +}
> +#endif /* CONFIG_SECCOMP_FILTER */
> --
> 2.11.0
>
Andy Lutomirski March 2, 2017, 4:36 p.m. UTC | #6
On Wed, Mar 1, 2017 at 3:28 PM, Mickaël Salaün <mic@digikod.net> wrote:
>
>
> On 01/03/2017 23:20, Andy Lutomirski wrote:
>> On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>
>>> On 28/02/2017 21:01, Andy Lutomirski wrote:
>>>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>> This design makes it possible for a process to add more constraints to
>>> its children on the fly. I think it is a good feature to have and a
>>> safer default inheritance mechanism, but it could be guarded by an
>>> option flag if we want both mechanism to be available. The same design
>>> could be used by seccomp filter too.
>>>
>>
>> Then let's do it right.
>>
>> Currently each task has an array of seccomp filter layers.  When a
>> task forks, the child inherits the layers.  All the layers are
>> presently immutable.  With Landlock, a layer can logically be a
>> syscall fitler layer or a Landlock layer.  This fits in to the
>> existing model just fine.
>>
>> If we want to have an interface to allow modification of an existing
>> layer, let's make it so that, when a layer is added, you have to
>> specify a flag to make the layer modifiable (by current, presumably,
>> although I can imagine other policies down the road).  Then have a
>> separate API that modifies a layer.
>>
>> IOW, I think your patch is bad for three reasons, all fixable:
>>
>> 1. The default is wrong.  A layer should be immutable to avoid an easy
>> attack in which you try to sandbox *yourself* and then you just modify
>> the layer to weaken it.
>
> This is not possible, there is only an operation for now:
> SECCOMP_ADD_LANDLOCK_RULE. You can only add more rules to the list (as
> for seccomp filter). There is no way to weaken a sandbox. The question
> is: how do we want to handle the rules *tree* (from the kernel point of
> view)?
>

Fair enough.  But I still think that immutability (like regular
seccomp) should be the default.  For security, simplicity is
important.  I guess there could be two ways to relax immutability:
allowing making the layer stricter and allowing any change at all.

As a default, though, programs should be able to expect that:

seccomp(SECCOMP_ADD_WHATEVER, ...);
fork();

[parent gets compromised]
[in parent]seccomp(anything whatsoever);

will not affect the child,  If the parent wants to relax that, that's
fine, but I think it should be explicit.

>>
>> 2. The API that adds a layer should be different from the API that
>> modifies a layer.
>
> Right, but it doesn't apply now because we can only add rules.

That's not what the code appears to do, though.  Sometimes it makes a
new layer without modifying tasks that share the layer and sometimes
it modifies the layer.

Both operations are probably okay, but they're not the same operation
and they shouldn't pretend to be.


>
>>
>> 3. The whole modification mechanism should be a separate patch to be
>> reviewed on its own merits.
>
> For a rule *replacement*, sure!

And for modification of policy for non-current tasks.  That's a big
departure from normal seccomp and should be reviewed as such.
Mickaël Salaün March 3, 2017, 12:48 a.m. UTC | #7
On 02/03/2017 17:36, Andy Lutomirski wrote:
> On Wed, Mar 1, 2017 at 3:28 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>>
>> On 01/03/2017 23:20, Andy Lutomirski wrote:
>>> On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>> On 28/02/2017 21:01, Andy Lutomirski wrote:
>>>>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>> This design makes it possible for a process to add more constraints to
>>>> its children on the fly. I think it is a good feature to have and a
>>>> safer default inheritance mechanism, but it could be guarded by an
>>>> option flag if we want both mechanism to be available. The same design
>>>> could be used by seccomp filter too.
>>>>
>>>
>>> Then let's do it right.
>>>
>>> Currently each task has an array of seccomp filter layers.  When a
>>> task forks, the child inherits the layers.  All the layers are
>>> presently immutable.  With Landlock, a layer can logically be a
>>> syscall fitler layer or a Landlock layer.  This fits in to the
>>> existing model just fine.
>>>
>>> If we want to have an interface to allow modification of an existing
>>> layer, let's make it so that, when a layer is added, you have to
>>> specify a flag to make the layer modifiable (by current, presumably,
>>> although I can imagine other policies down the road).  Then have a
>>> separate API that modifies a layer.
>>>
>>> IOW, I think your patch is bad for three reasons, all fixable:
>>>
>>> 1. The default is wrong.  A layer should be immutable to avoid an easy
>>> attack in which you try to sandbox *yourself* and then you just modify
>>> the layer to weaken it.
>>
>> This is not possible, there is only an operation for now:
>> SECCOMP_ADD_LANDLOCK_RULE. You can only add more rules to the list (as
>> for seccomp filter). There is no way to weaken a sandbox. The question
>> is: how do we want to handle the rules *tree* (from the kernel point of
>> view)?
>>
> 
> Fair enough.  But I still think that immutability (like regular
> seccomp) should be the default.  For security, simplicity is
> important.  I guess there could be two ways to relax immutability:
> allowing making the layer stricter and allowing any change at all.
> 
> As a default, though, programs should be able to expect that:
> 
> seccomp(SECCOMP_ADD_WHATEVER, ...);
> fork();
> 
> [parent gets compromised]
> [in parent]seccomp(anything whatsoever);
> 
> will not affect the child,  If the parent wants to relax that, that's
> fine, but I think it should be explicit.

Good point. However the term "immutability" doesn't fit right because
the process is still allowed to add more rules to itself (as for
seccomp). The difference lays in the way a rule may be "appended" (by
the current process) or "inserted" (by a parent process).

I think three or four kind of operations (through the seccomp syscall)
make sense:
* append a rule (for the current process and its future children)
* add a node (insert point), from which the inserted rules will be tied
* insert a rule in the node, which will be inherited by futures children
* (maybe a "lock" command to make a layer immutable for the current
process and its children)

Doing so, a process is only allowed to insert a rule if a node was
previously added. To forbid itself to insert new rules to one of its
children, a process just need to not add a node before forking. Like
this, there is no need for special rule flags nor default behavior,
everything is explicit.

For this series, I will stick to the same behavior as seccomp filter:
only append rules to the current process (and its future children).


>>> 2. The API that adds a layer should be different from the API that
>>> modifies a layer.
>>
>> Right, but it doesn't apply now because we can only add rules.
> 
> That's not what the code appears to do, though.  Sometimes it makes a
> new layer without modifying tasks that share the layer and sometimes
> it modifies the layer.
> 
> Both operations are probably okay, but they're not the same operation
> and they shouldn't pretend to be.

It should be OK with my previous proposal. The other details could be
discussed in a separate future patch series.


>>> 3. The whole modification mechanism should be a separate patch to be
>>> reviewed on its own merits.
>>
>> For a rule *replacement*, sure!
> 
> And for modification of policy for non-current tasks.  That's a big
> departure from normal seccomp and should be reviewed as such.

Agreed
Mickaël Salaün March 3, 2017, 12:54 a.m. UTC | #8
On 02/03/2017 11:22, Djalal Harouni wrote:
> On Wed, Feb 22, 2017 at 2:26 AM, Mickaël Salaün <mic@digikod.net> wrote:
>> The seccomp(2) syscall can be use to apply a Landlock rule to the
>> current process. As with a seccomp filter, the Landlock rule is enforced
>> for all its future children. An inherited rule tree can be updated
>> (append-only) by the owner of inherited Landlock nodes (e.g. a parent
>> process that create a new rule). However, an intermediate task, which
>> did not create a rule, will not be able to update its children's rules.
>>
>> Landlock rules can be tied to a Landlock event. When such an event is
>> triggered, a tree of rules can be evaluated. Thisk kind of tree is
>> created with a first node.  This node reference a list of rules and an
>> optional parent node. Each rule return a 32-bit value which can
>> interrupt the evaluation with a non-zero value. If every rules returned
>> zero, the evaluation continues with the rule list of the parent node,
>> until the end of the tree.
>>
>> Changes since v4:
>> * merge manager and seccomp patches
>> * return -EFAULT in seccomp(2) when user_bpf_fd is null to easely check
>>   if Landlock is supported
>> * only allow a process with the global CAP_SYS_ADMIN to use Landlock
>>   (will be lifted in the future)
>> * add an early check to exit as soon as possible if the current process
>>   does not have Landlock rules
>>
>> Changes since v3:
>> * remove the hard link with seccomp (suggested by Andy Lutomirski and
>>   Kees Cook):
>>   * remove the cookie which could imply multiple evaluation of Landlock
>>     rules
>>   * remove the origin field in struct landlock_data
>> * remove documentation fix (merged upstream)
>> * rename the new seccomp command to SECCOMP_ADD_LANDLOCK_RULE
>> * internal renaming
>> * split commit
>> * new design to be able to inherit on the fly the parent rules
>>
>> Changes since v2:
>> * Landlock programs can now be run without seccomp filter but for any
>>   syscall (from the process) or interruption
>> * move Landlock related functions and structs into security/landlock/*
>>   (to manage cgroups as well)
>> * fix seccomp filter handling: run Landlock programs for each of their
>>   legitimate seccomp filter
>> * properly clean up all seccomp results
>> * cosmetic changes to ease the understanding
>> * fix some ifdef
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: James Morris <james.l.morris@oracle.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Serge E. Hallyn <serge@hallyn.com>
>> Cc: Will Drewry <wad@chromium.org>
>> ---
>>  include/linux/seccomp.h      |   8 ++
>>  include/uapi/linux/seccomp.h |   1 +
>>  kernel/fork.c                |  14 +-
>>  kernel/seccomp.c             |   8 ++
>>  security/landlock/Makefile   |   2 +-
>>  security/landlock/hooks.c    |  42 +++++-
>>  security/landlock/manager.c  | 321 +++++++++++++++++++++++++++++++++++++++++++
>>  7 files changed, 392 insertions(+), 4 deletions(-)
>>  create mode 100644 security/landlock/manager.c
>>
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index e25aee2cdfc0..9a38de3c0e72 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -10,6 +10,10 @@
>>  #include <linux/thread_info.h>
>>  #include <asm/seccomp.h>
>>
>> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
>> +struct landlock_events;
>> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
>> +
>>  struct seccomp_filter;
>>  /**
>>   * struct seccomp - the state of a seccomp'ed process
>> @@ -18,6 +22,7 @@ struct seccomp_filter;
>>   *         system calls available to a process.
>>   * @filter: must always point to a valid seccomp-filter or NULL as it is
>>   *          accessed without locking during system call entry.
>> + * @landlock_events: contains an array of Landlock rules.
>>   *
>>   *          @filter must only be accessed from the context of current as there
>>   *          is no read locking.
>> @@ -25,6 +30,9 @@ struct seccomp_filter;
>>  struct seccomp {
>>         int mode;
>>         struct seccomp_filter *filter;
>> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
>> +       struct landlock_events *landlock_events;
>> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
>>  };
>>
>>  #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>> index 0f238a43ff1e..56dd692cddac 100644
>> --- a/include/uapi/linux/seccomp.h
>> +++ b/include/uapi/linux/seccomp.h
>> @@ -13,6 +13,7 @@
>>  /* Valid operations for seccomp syscall. */
>>  #define SECCOMP_SET_MODE_STRICT        0
>>  #define SECCOMP_SET_MODE_FILTER        1
>> +#define SECCOMP_ADD_LANDLOCK_RULE      2
>>
>>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>>  #define SECCOMP_FILTER_FLAG_TSYNC      1
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index a4f0d0e8aeb2..bd5c72dffe60 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -37,6 +37,7 @@
>>  #include <linux/security.h>
>>  #include <linux/hugetlb.h>
>>  #include <linux/seccomp.h>
>> +#include <linux/landlock.h>
>>  #include <linux/swap.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/jiffies.h>
>> @@ -515,7 +516,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>>          * the usage counts on the error path calling free_task.
>>          */
>>         tsk->seccomp.filter = NULL;
>> -#endif
>> +#ifdef CONFIG_SECURITY_LANDLOCK
>> +       tsk->seccomp.landlock_events = NULL;
>> +#endif /* CONFIG_SECURITY_LANDLOCK */
>> +#endif /* CONFIG_SECCOMP */
>>
>>         setup_thread_stack(tsk, orig);
>>         clear_user_return_notifier(tsk);
>> @@ -1388,7 +1392,13 @@ static void copy_seccomp(struct task_struct *p)
>>
>>         /* Ref-count the new filter user, and assign it. */
>>         get_seccomp_filter(current);
>> -       p->seccomp = current->seccomp;
>> +       p->seccomp.mode = current->seccomp.mode;
>> +       p->seccomp.filter = current->seccomp.filter;
>> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
>> +       p->seccomp.landlock_events = current->seccomp.landlock_events;
>> +       if (p->seccomp.landlock_events)
>> +               atomic_inc(&p->seccomp.landlock_events->usage);
>> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
>>
>>         /*
>>          * Explicitly enable no_new_privs here in case it got set
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 06f2f3ee454c..ef412d95ff5d 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -32,6 +32,7 @@
>>  #include <linux/security.h>
>>  #include <linux/tracehook.h>
>>  #include <linux/uaccess.h>
>> +#include <linux/landlock.h>
>>
>>  /**
>>   * struct seccomp_filter - container for seccomp BPF programs
>> @@ -492,6 +493,9 @@ static void put_seccomp_filter(struct seccomp_filter *filter)
>>  void put_seccomp(struct task_struct *tsk)
>>  {
>>         put_seccomp_filter(tsk->seccomp.filter);
>> +#ifdef CONFIG_SECURITY_LANDLOCK
>> +       put_landlock_events(tsk->seccomp.landlock_events);
>> +#endif /* CONFIG_SECURITY_LANDLOCK */
>>  }
>>
>>  /**
>> @@ -796,6 +800,10 @@ static long do_seccomp(unsigned int op, unsigned int flags,
>>                 return seccomp_set_mode_strict();
>>         case SECCOMP_SET_MODE_FILTER:
>>                 return seccomp_set_mode_filter(flags, uargs);
>> +#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
>> +       case SECCOMP_ADD_LANDLOCK_RULE:
>> +               return landlock_seccomp_append_prog(flags, uargs);
>> +#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
>>         default:
>>                 return -EINVAL;
>>         }
>> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
>> index 8dc8bde660bd..6c1b0d8bd810 100644
>> --- a/security/landlock/Makefile
>> +++ b/security/landlock/Makefile
>> @@ -2,4 +2,4 @@ ccflags-$(CONFIG_SECURITY_LANDLOCK) += -Werror=unused-function
>>
>>  obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>>
>> -landlock-y := hooks.o
>> +landlock-y := hooks.o manager.o
>> diff --git a/security/landlock/hooks.c b/security/landlock/hooks.c
>> index 88ebe3f01758..704ea18377d2 100644
>> --- a/security/landlock/hooks.c
>> +++ b/security/landlock/hooks.c
>> @@ -290,7 +290,44 @@ static u64 mem_prot_to_access(unsigned long prot, bool private)
>>
>>  static inline bool landlock_used(void)
>>  {
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +       return !!(current->seccomp.landlock_events);
>> +#else
>>         return false;
>> +#endif /* CONFIG_SECCOMP_FILTER */
>> +}
>> +
>> +/**
>> + * landlock_run_prog - run Landlock program for a syscall
>> + *
>> + * @event_idx: event index in the rules array
>> + * @ctx: non-NULL eBPF context
>> + * @events: Landlock events pointer
>> + */
>> +static int landlock_run_prog(u32 event_idx, const struct landlock_context *ctx,
>> +               struct landlock_events *events)
>> +{
>> +       struct landlock_node *node;
>> +
>> +       if (!events)
>> +               return 0;
>> +
>> +       for (node = events->nodes[event_idx]; node; node = node->prev) {
>> +               struct landlock_rule *rule;
>> +
>> +               for (rule = node->rule; rule; rule = rule->prev) {
>> +                       u32 ret;
>> +
>> +                       if (WARN_ON(!rule->prog))
>> +                               continue;
>> +                       rcu_read_lock();
>> +                       ret = BPF_PROG_RUN(rule->prog, (void *)ctx);
>> +                       rcu_read_unlock();
>> +                       if (ret)
>> +                               return -EPERM;
>> +               }
>> +       }
>> +       return 0;
>>  }
>>
>>  static int landlock_decide(enum landlock_subtype_event event,
>> @@ -309,7 +346,10 @@ static int landlock_decide(enum landlock_subtype_event event,
>>                 .arg2 = ctx_values[1],
>>         };
>>
>> -       /* insert manager call here */
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +       ret = landlock_run_prog(event_idx, &ctx,
>> +                       current->seccomp.landlock_events);
>> +#endif /* CONFIG_SECCOMP_FILTER */
>>
>>         return ret;
>>  }
>> diff --git a/security/landlock/manager.c b/security/landlock/manager.c
>> new file mode 100644
>> index 000000000000..00bb2944c85e
>> --- /dev/null
>> +++ b/security/landlock/manager.c
>> @@ -0,0 +1,321 @@
>> +/*
>> + * Landlock LSM - seccomp manager
>> + *
>> + * Copyright © 2017 Mickaël Salaün <mic@digikod.net>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2, as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <asm/page.h> /* PAGE_SIZE */
>> +#include <linux/atomic.h> /* atomic_*(), smp_store_release() */
>> +#include <linux/bpf.h> /* bpf_prog_put() */
>> +#include <linux/filter.h> /* struct bpf_prog */
>> +#include <linux/kernel.h> /* round_up() */
>> +#include <linux/landlock.h>
>> +#include <linux/sched.h> /* current_cred(), task_no_new_privs() */
>> +#include <linux/security.h> /* security_capable_noaudit() */
>> +#include <linux/slab.h> /* alloc(), kfree() */
>> +#include <linux/types.h> /* atomic_t */
>> +#include <linux/uaccess.h> /* copy_from_user() */
>> +
>> +#include "common.h"
>> +
>> +static void put_landlock_rule(struct landlock_rule *rule)
>> +{
>> +       struct landlock_rule *orig = rule;
>> +
>> +       /* clean up single-reference branches iteratively */
>> +       while (orig && atomic_dec_and_test(&orig->usage)) {
>> +               struct landlock_rule *freeme = orig;
>> +
>> +               bpf_prog_put(orig->prog);
>> +               orig = orig->prev;
>> +               kfree(freeme);
>> +       }
>> +}
>> +
>> +static void put_landlock_node(struct landlock_node *node)
>> +{
>> +       struct landlock_node *orig = node;
>> +
>> +       /* clean up single-reference branches iteratively */
>> +       while (orig && atomic_dec_and_test(&orig->usage)) {
>> +               struct landlock_node *freeme = orig;
>> +
>> +               put_landlock_rule(orig->rule);
>> +               orig = orig->prev;
>> +               kfree(freeme);
>> +       }
>> +}
>> +
>> +void put_landlock_events(struct landlock_events *events)
>> +{
>> +       if (events && atomic_dec_and_test(&events->usage)) {
>> +               size_t i;
>> +
>> +               /* XXX: Do we need to use lockless_dereference() here? */
>> +               for (i = 0; i < ARRAY_SIZE(events->nodes); i++) {
>> +                       if (!events->nodes[i])
>> +                               continue;
>> +                       /* Are we the owner of this node? */
>> +                       if (events->nodes[i]->owner == &events->nodes[i])
>> +                               events->nodes[i]->owner = NULL;
>> +                       put_landlock_node(events->nodes[i]);
>> +               }
>> +               kfree(events);
>> +       }
>> +}
>> +
>> +static struct landlock_events *new_raw_landlock_events(void)
>> +{
>> +       struct landlock_events *ret;
>> +
>> +       /* array filled with NULL values */
>> +       ret = kzalloc(sizeof(*ret), GFP_KERNEL);
>> +       if (!ret)
>> +               return ERR_PTR(-ENOMEM);
>> +       atomic_set(&ret->usage, 1);
>> +       return ret;
>> +}
>> +
>> +static struct landlock_events *new_filled_landlock_events(void)
>> +{
>> +       size_t i;
>> +       struct landlock_events *ret;
>> +
>> +       ret = new_raw_landlock_events();
>> +       if (IS_ERR(ret))
>> +               return ret;
>> +       /*
>> +        * We need to initially allocate every nodes to be able to update the
>> +        * rules they are pointing to, across every (future) children of the
>> +        * current task.
>> +        */
>> +       for (i = 0; i < ARRAY_SIZE(ret->nodes); i++) {
>> +               struct landlock_node *node;
>> +
>> +               node = kzalloc(sizeof(*node), GFP_KERNEL);
>> +               if (!node)
>> +                       goto put_events;
>> +               atomic_set(&node->usage, 1);
>> +               /* we are the owner of this node */
>> +               node->owner = &ret->nodes[i];
>> +               ret->nodes[i] = node;
>> +       }
>> +       return ret;
>> +
>> +put_events:
>> +       put_landlock_events(ret);
>> +       return ERR_PTR(-ENOMEM);
>> +}
>> +
>> +static void add_landlock_rule(struct landlock_events *events,
>> +               struct landlock_rule *rule)
>> +{
>> +       /* subtype.landlock_rule.event > 0 for loaded programs */
>> +       u32 event_idx = get_index(rule->prog->subtype.landlock_rule.event);
>> +
>> +       rule->prev = events->nodes[event_idx]->rule;
>> +       WARN_ON(atomic_read(&rule->usage));
>> +       atomic_set(&rule->usage, 1);
>> +       /* do not increment the previous rule usage */
>> +       smp_store_release(&events->nodes[event_idx]->rule, rule);
>> +}
>> +
>> +/* Limit Landlock events to 256KB. */
>> +#define LANDLOCK_EVENTS_MAX_PAGES (1 << 6)
>> +
>> +/**
>> + * landlock_append_prog - attach a Landlock rule to @current_events
>> + *
>> + * @current_events: landlock_events pointer, must be locked (if needed) to
>> + *                  prevent a concurrent put/free. This pointer must not be
>> + *                  freed after the call.
>> + * @prog: non-NULL Landlock rule to append to @current_events. @prog will be
>> + *        owned by landlock_append_prog() and freed if an error happened.
>> + *
>> + * Return @current_events or a new pointer when OK. Return a pointer error
>> + * otherwise.
>> + */
>> +static struct landlock_events *landlock_append_prog(
>> +               struct landlock_events *current_events, struct bpf_prog *prog)
>> +{
>> +       struct landlock_events *new_events = current_events;
>> +       unsigned long pages;
>> +       struct landlock_rule *rule;
>> +       u32 event_idx;
>> +
>> +       if (prog->type != BPF_PROG_TYPE_LANDLOCK) {
>> +               new_events = ERR_PTR(-EINVAL);
>> +               goto put_prog;
>> +       }
>> +
>> +       /* validate memory size allocation */
>> +       pages = prog->pages;
>> +       if (current_events) {
>> +               size_t i;
>> +
>> +               for (i = 0; i < ARRAY_SIZE(current_events->nodes); i++) {
>> +                       struct landlock_node *walker_n;
>> +
>> +                       for (walker_n = current_events->nodes[i];
>> +                                       walker_n;
>> +                                       walker_n = walker_n->prev) {
>> +                               struct landlock_rule *walker_r;
>> +
>> +                               for (walker_r = walker_n->rule;
>> +                                               walker_r;
>> +                                               walker_r = walker_r->prev)
>> +                                       pages += walker_r->prog->pages;
>> +                       }
>> +               }
>> +               /* count a struct landlock_events if we need to allocate one */
>> +               if (atomic_read(&current_events->usage) != 1)
>> +                       pages += round_up(sizeof(*current_events), PAGE_SIZE) /
>> +                               PAGE_SIZE;
>> +       }
>> +       if (pages > LANDLOCK_EVENTS_MAX_PAGES) {
>> +               new_events = ERR_PTR(-E2BIG);
>> +               goto put_prog;
>> +       }
>> +
>> +       rule = kzalloc(sizeof(*rule), GFP_KERNEL);
>> +       if (!rule) {
>> +               new_events = ERR_PTR(-ENOMEM);
>> +               goto put_prog;
>> +       }
>> +       rule->prog = prog;
>> +
>> +       /* subtype.landlock_rule.event > 0 for loaded programs */
>> +       event_idx = get_index(rule->prog->subtype.landlock_rule.event);
>> +
>> +       if (!current_events) {
>> +               /* add a new landlock_events, if needed */
>> +               new_events = new_filled_landlock_events();
>> +               if (IS_ERR(new_events))
>> +                       goto put_rule;
>> +               add_landlock_rule(new_events, rule);
>> +       } else {
>> +               if (new_events->nodes[event_idx]->owner ==
>> +                               &new_events->nodes[event_idx]) {
>> +                       /* We are the owner, we can then update the node. */
>> +                       add_landlock_rule(new_events, rule);
>> +               } else if (atomic_read(&current_events->usage) == 1) {
>> +                       WARN_ON(new_events->nodes[event_idx]->owner);
>> +                       /*
>> +                        * We can become the new owner if no other task use it.
>> +                        * This avoid an unnecessary allocation.
>> +                        */
>> +                       new_events->nodes[event_idx]->owner =
>> +                               &new_events->nodes[event_idx];
>> +                       add_landlock_rule(new_events, rule);
> 
> I still don't get it all, but maybe here to make it simple you have to
> always allocate, since the WARN_ON() suggests it should be scheduled
> to be removed... and avoid to care about whether you are using/freeing
> old rules of the dead task... ?

I'm not sure to get your point but I will make the inheritance behavior
similar to seccomp filter at first, hence simpler.
Andy Lutomirski March 3, 2017, 12:55 a.m. UTC | #9
On Thu, Mar 2, 2017 at 4:48 PM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 02/03/2017 17:36, Andy Lutomirski wrote:
>> On Wed, Mar 1, 2017 at 3:28 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>
>>>
>>> On 01/03/2017 23:20, Andy Lutomirski wrote:
>>>> On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>>
>>>>> On 28/02/2017 21:01, Andy Lutomirski wrote:
>>>>>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>> This design makes it possible for a process to add more constraints to
>>>>> its children on the fly. I think it is a good feature to have and a
>>>>> safer default inheritance mechanism, but it could be guarded by an
>>>>> option flag if we want both mechanism to be available. The same design
>>>>> could be used by seccomp filter too.
>>>>>
>>>>
>>>> Then let's do it right.
>>>>
>>>> Currently each task has an array of seccomp filter layers.  When a
>>>> task forks, the child inherits the layers.  All the layers are
>>>> presently immutable.  With Landlock, a layer can logically be a
>>>> syscall fitler layer or a Landlock layer.  This fits in to the
>>>> existing model just fine.
>>>>
>>>> If we want to have an interface to allow modification of an existing
>>>> layer, let's make it so that, when a layer is added, you have to
>>>> specify a flag to make the layer modifiable (by current, presumably,
>>>> although I can imagine other policies down the road).  Then have a
>>>> separate API that modifies a layer.
>>>>
>>>> IOW, I think your patch is bad for three reasons, all fixable:
>>>>
>>>> 1. The default is wrong.  A layer should be immutable to avoid an easy
>>>> attack in which you try to sandbox *yourself* and then you just modify
>>>> the layer to weaken it.
>>>
>>> This is not possible, there is only an operation for now:
>>> SECCOMP_ADD_LANDLOCK_RULE. You can only add more rules to the list (as
>>> for seccomp filter). There is no way to weaken a sandbox. The question
>>> is: how do we want to handle the rules *tree* (from the kernel point of
>>> view)?
>>>
>>
>> Fair enough.  But I still think that immutability (like regular
>> seccomp) should be the default.  For security, simplicity is
>> important.  I guess there could be two ways to relax immutability:
>> allowing making the layer stricter and allowing any change at all.
>>
>> As a default, though, programs should be able to expect that:
>>
>> seccomp(SECCOMP_ADD_WHATEVER, ...);
>> fork();
>>
>> [parent gets compromised]
>> [in parent]seccomp(anything whatsoever);
>>
>> will not affect the child,  If the parent wants to relax that, that's
>> fine, but I think it should be explicit.
>
> Good point. However the term "immutability" doesn't fit right because
> the process is still allowed to add more rules to itself (as for
> seccomp). The difference lays in the way a rule may be "appended" (by
> the current process) or "inserted" (by a parent process).
>
> I think three or four kind of operations (through the seccomp syscall)
> make sense:
> * append a rule (for the current process and its future children)

Sure, but this operation should *never* affect existing children,
existing seccomp layers, existing nodes, etc.  It should affect
current and future children only.  Or it could simply not exist for
Landlock and instead you'd have to add a layer (see below) and then
program that layer.

> * add a node (insert point), from which the inserted rules will be tied
> * insert a rule in the node, which will be inherited by futures children

I would advocate calling this a "seccomp layer" and making creation
and manipulation of them generic.

> * (maybe a "lock" command to make a layer immutable for the current
> process and its children)

Hmm, maybe.

>
> Doing so, a process is only allowed to insert a rule if a node was
> previously added. To forbid itself to insert new rules to one of its
> children, a process just need to not add a node before forking. Like
> this, there is no need for special rule flags nor default behavior,
> everything is explicit.

This is still slightly too complicated.  If you really want an
operation that adds a layer (please don't call it a node in the ABI)
and adds a rule to that layer in a single operation, it should be a
separate operation.  Please make everything explicit.

(I don't like exposing the word "node" to userspace because it means
nothing.  Having more than one layer of filter makes sense to me,
which is why I like "layer".  I'm sure that other good choices exist.)

>
> For this series, I will stick to the same behavior as seccomp filter:
> only append rules to the current process (and its future children).
>
>
>>>> 2. The API that adds a layer should be different from the API that
>>>> modifies a layer.
>>>
>>> Right, but it doesn't apply now because we can only add rules.
>>
>> That's not what the code appears to do, though.  Sometimes it makes a
>> new layer without modifying tasks that share the layer and sometimes
>> it modifies the layer.
>>
>> Both operations are probably okay, but they're not the same operation
>> and they shouldn't pretend to be.
>
> It should be OK with my previous proposal. The other details could be
> discussed in a separate future patch series.
>

NAK, or at least NAK pending better docs and justification.  The
operations of "add a layer and put a rule in it" and "add a rule to an
existing layer" are logically different and should not be the same
SECCOMP operation.  "Do what I mean" is a nice paradigm for a language
like Perl, but for security (and for kernel interfaces in general),
"do what I say and error out if I said nonsense" is much safer.
Mickaël Salaün March 3, 2017, 1:05 a.m. UTC | #10
On 03/03/2017 01:55, Andy Lutomirski wrote:
> On Thu, Mar 2, 2017 at 4:48 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 02/03/2017 17:36, Andy Lutomirski wrote:
>>> On Wed, Mar 1, 2017 at 3:28 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>>
>>>> On 01/03/2017 23:20, Andy Lutomirski wrote:
>>>>> On Wed, Mar 1, 2017 at 2:14 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>>>
>>>>>> On 28/02/2017 21:01, Andy Lutomirski wrote:
>>>>>>> On Tue, Feb 21, 2017 at 5:26 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>>> This design makes it possible for a process to add more constraints to
>>>>>> its children on the fly. I think it is a good feature to have and a
>>>>>> safer default inheritance mechanism, but it could be guarded by an
>>>>>> option flag if we want both mechanism to be available. The same design
>>>>>> could be used by seccomp filter too.
>>>>>>
>>>>>
>>>>> Then let's do it right.
>>>>>
>>>>> Currently each task has an array of seccomp filter layers.  When a
>>>>> task forks, the child inherits the layers.  All the layers are
>>>>> presently immutable.  With Landlock, a layer can logically be a
>>>>> syscall fitler layer or a Landlock layer.  This fits in to the
>>>>> existing model just fine.
>>>>>
>>>>> If we want to have an interface to allow modification of an existing
>>>>> layer, let's make it so that, when a layer is added, you have to
>>>>> specify a flag to make the layer modifiable (by current, presumably,
>>>>> although I can imagine other policies down the road).  Then have a
>>>>> separate API that modifies a layer.
>>>>>
>>>>> IOW, I think your patch is bad for three reasons, all fixable:
>>>>>
>>>>> 1. The default is wrong.  A layer should be immutable to avoid an easy
>>>>> attack in which you try to sandbox *yourself* and then you just modify
>>>>> the layer to weaken it.
>>>>
>>>> This is not possible, there is only an operation for now:
>>>> SECCOMP_ADD_LANDLOCK_RULE. You can only add more rules to the list (as
>>>> for seccomp filter). There is no way to weaken a sandbox. The question
>>>> is: how do we want to handle the rules *tree* (from the kernel point of
>>>> view)?
>>>>
>>>
>>> Fair enough.  But I still think that immutability (like regular
>>> seccomp) should be the default.  For security, simplicity is
>>> important.  I guess there could be two ways to relax immutability:
>>> allowing making the layer stricter and allowing any change at all.
>>>
>>> As a default, though, programs should be able to expect that:
>>>
>>> seccomp(SECCOMP_ADD_WHATEVER, ...);
>>> fork();
>>>
>>> [parent gets compromised]
>>> [in parent]seccomp(anything whatsoever);
>>>
>>> will not affect the child,  If the parent wants to relax that, that's
>>> fine, but I think it should be explicit.
>>
>> Good point. However the term "immutability" doesn't fit right because
>> the process is still allowed to add more rules to itself (as for
>> seccomp). The difference lays in the way a rule may be "appended" (by
>> the current process) or "inserted" (by a parent process).
>>
>> I think three or four kind of operations (through the seccomp syscall)
>> make sense:
>> * append a rule (for the current process and its future children)
> 
> Sure, but this operation should *never* affect existing children,
> existing seccomp layers, existing nodes, etc.  It should affect
> current and future children only.  Or it could simply not exist for
> Landlock and instead you'd have to add a layer (see below) and then
> program that layer.
> 
>> * add a node (insert point), from which the inserted rules will be tied
>> * insert a rule in the node, which will be inherited by futures children
> 
> I would advocate calling this a "seccomp layer" and making creation
> and manipulation of them generic.
> 
>> * (maybe a "lock" command to make a layer immutable for the current
>> process and its children)
> 
> Hmm, maybe.
> 
>>
>> Doing so, a process is only allowed to insert a rule if a node was
>> previously added. To forbid itself to insert new rules to one of its
>> children, a process just need to not add a node before forking. Like
>> this, there is no need for special rule flags nor default behavior,
>> everything is explicit.
> 
> This is still slightly too complicated.  If you really want an
> operation that adds a layer (please don't call it a node in the ABI)
> and adds a rule to that layer in a single operation, it should be a
> separate operation.  Please make everything explicit.
> 
> (I don't like exposing the word "node" to userspace because it means
> nothing.  Having more than one layer of filter makes sense to me,
> which is why I like "layer".  I'm sure that other good choices exist.)

I keep that for a future discussion, I'm now convinced the simple
inheritance (seccomp-like) doesn't block future extension.

> 
>>
>> For this series, I will stick to the same behavior as seccomp filter:
>> only append rules to the current process (and its future children).
>>
>>
>>>>> 2. The API that adds a layer should be different from the API that
>>>>> modifies a layer.
>>>>
>>>> Right, but it doesn't apply now because we can only add rules.
>>>
>>> That's not what the code appears to do, though.  Sometimes it makes a
>>> new layer without modifying tasks that share the layer and sometimes
>>> it modifies the layer.
>>>
>>> Both operations are probably okay, but they're not the same operation
>>> and they shouldn't pretend to be.
>>
>> It should be OK with my previous proposal. The other details could be
>> discussed in a separate future patch series.
>>
> 
> NAK, or at least NAK pending better docs and justification.  The
> operations of "add a layer and put a rule in it" and "add a rule to an
> existing layer" are logically different and should not be the same
> SECCOMP operation.

We are agree.

> "Do what I mean" is a nice paradigm for a language
> like Perl, but for security (and for kernel interfaces in general),
> "do what I say and error out if I said nonsense" is much safer.
> 

Totally agree.
diff mbox

Patch

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index e25aee2cdfc0..9a38de3c0e72 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -10,6 +10,10 @@ 
 #include <linux/thread_info.h>
 #include <asm/seccomp.h>
 
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+struct landlock_events;
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
+
 struct seccomp_filter;
 /**
  * struct seccomp - the state of a seccomp'ed process
@@ -18,6 +22,7 @@  struct seccomp_filter;
  *         system calls available to a process.
  * @filter: must always point to a valid seccomp-filter or NULL as it is
  *          accessed without locking during system call entry.
+ * @landlock_events: contains an array of Landlock rules.
  *
  *          @filter must only be accessed from the context of current as there
  *          is no read locking.
@@ -25,6 +30,9 @@  struct seccomp_filter;
 struct seccomp {
 	int mode;
 	struct seccomp_filter *filter;
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+	struct landlock_events *landlock_events;
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
 };
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0f238a43ff1e..56dd692cddac 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -13,6 +13,7 @@ 
 /* Valid operations for seccomp syscall. */
 #define SECCOMP_SET_MODE_STRICT	0
 #define SECCOMP_SET_MODE_FILTER	1
+#define SECCOMP_ADD_LANDLOCK_RULE	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
 #define SECCOMP_FILTER_FLAG_TSYNC	1
diff --git a/kernel/fork.c b/kernel/fork.c
index a4f0d0e8aeb2..bd5c72dffe60 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -37,6 +37,7 @@ 
 #include <linux/security.h>
 #include <linux/hugetlb.h>
 #include <linux/seccomp.h>
+#include <linux/landlock.h>
 #include <linux/swap.h>
 #include <linux/syscalls.h>
 #include <linux/jiffies.h>
@@ -515,7 +516,10 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	 * the usage counts on the error path calling free_task.
 	 */
 	tsk->seccomp.filter = NULL;
-#endif
+#ifdef CONFIG_SECURITY_LANDLOCK
+	tsk->seccomp.landlock_events = NULL;
+#endif /* CONFIG_SECURITY_LANDLOCK */
+#endif /* CONFIG_SECCOMP */
 
 	setup_thread_stack(tsk, orig);
 	clear_user_return_notifier(tsk);
@@ -1388,7 +1392,13 @@  static void copy_seccomp(struct task_struct *p)
 
 	/* Ref-count the new filter user, and assign it. */
 	get_seccomp_filter(current);
-	p->seccomp = current->seccomp;
+	p->seccomp.mode = current->seccomp.mode;
+	p->seccomp.filter = current->seccomp.filter;
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+	p->seccomp.landlock_events = current->seccomp.landlock_events;
+	if (p->seccomp.landlock_events)
+		atomic_inc(&p->seccomp.landlock_events->usage);
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
 
 	/*
 	 * Explicitly enable no_new_privs here in case it got set
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 06f2f3ee454c..ef412d95ff5d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -32,6 +32,7 @@ 
 #include <linux/security.h>
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
+#include <linux/landlock.h>
 
 /**
  * struct seccomp_filter - container for seccomp BPF programs
@@ -492,6 +493,9 @@  static void put_seccomp_filter(struct seccomp_filter *filter)
 void put_seccomp(struct task_struct *tsk)
 {
 	put_seccomp_filter(tsk->seccomp.filter);
+#ifdef CONFIG_SECURITY_LANDLOCK
+	put_landlock_events(tsk->seccomp.landlock_events);
+#endif /* CONFIG_SECURITY_LANDLOCK */
 }
 
 /**
@@ -796,6 +800,10 @@  static long do_seccomp(unsigned int op, unsigned int flags,
 		return seccomp_set_mode_strict();
 	case SECCOMP_SET_MODE_FILTER:
 		return seccomp_set_mode_filter(flags, uargs);
+#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_SECURITY_LANDLOCK)
+	case SECCOMP_ADD_LANDLOCK_RULE:
+		return landlock_seccomp_append_prog(flags, uargs);
+#endif /* CONFIG_SECCOMP_FILTER && CONFIG_SECURITY_LANDLOCK */
 	default:
 		return -EINVAL;
 	}
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index 8dc8bde660bd..6c1b0d8bd810 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -2,4 +2,4 @@  ccflags-$(CONFIG_SECURITY_LANDLOCK) += -Werror=unused-function
 
 obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
 
-landlock-y := hooks.o
+landlock-y := hooks.o manager.o
diff --git a/security/landlock/hooks.c b/security/landlock/hooks.c
index 88ebe3f01758..704ea18377d2 100644
--- a/security/landlock/hooks.c
+++ b/security/landlock/hooks.c
@@ -290,7 +290,44 @@  static u64 mem_prot_to_access(unsigned long prot, bool private)
 
 static inline bool landlock_used(void)
 {
+#ifdef CONFIG_SECCOMP_FILTER
+	return !!(current->seccomp.landlock_events);
+#else
 	return false;
+#endif /* CONFIG_SECCOMP_FILTER */
+}
+
+/**
+ * landlock_run_prog - run Landlock program for a syscall
+ *
+ * @event_idx: event index in the rules array
+ * @ctx: non-NULL eBPF context
+ * @events: Landlock events pointer
+ */
+static int landlock_run_prog(u32 event_idx, const struct landlock_context *ctx,
+		struct landlock_events *events)
+{
+	struct landlock_node *node;
+
+	if (!events)
+		return 0;
+
+	for (node = events->nodes[event_idx]; node; node = node->prev) {
+		struct landlock_rule *rule;
+
+		for (rule = node->rule; rule; rule = rule->prev) {
+			u32 ret;
+
+			if (WARN_ON(!rule->prog))
+				continue;
+			rcu_read_lock();
+			ret = BPF_PROG_RUN(rule->prog, (void *)ctx);
+			rcu_read_unlock();
+			if (ret)
+				return -EPERM;
+		}
+	}
+	return 0;
 }
 
 static int landlock_decide(enum landlock_subtype_event event,
@@ -309,7 +346,10 @@  static int landlock_decide(enum landlock_subtype_event event,
 		.arg2 = ctx_values[1],
 	};
 
-	/* insert manager call here */
+#ifdef CONFIG_SECCOMP_FILTER
+	ret = landlock_run_prog(event_idx, &ctx,
+			current->seccomp.landlock_events);
+#endif /* CONFIG_SECCOMP_FILTER */
 
 	return ret;
 }
diff --git a/security/landlock/manager.c b/security/landlock/manager.c
new file mode 100644
index 000000000000..00bb2944c85e
--- /dev/null
+++ b/security/landlock/manager.c
@@ -0,0 +1,321 @@ 
+/*
+ * Landlock LSM - seccomp manager
+ *
+ * Copyright © 2017 Mickaël Salaün <mic@digikod.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#include <asm/page.h> /* PAGE_SIZE */
+#include <linux/atomic.h> /* atomic_*(), smp_store_release() */
+#include <linux/bpf.h> /* bpf_prog_put() */
+#include <linux/filter.h> /* struct bpf_prog */
+#include <linux/kernel.h> /* round_up() */
+#include <linux/landlock.h>
+#include <linux/sched.h> /* current_cred(), task_no_new_privs() */
+#include <linux/security.h> /* security_capable_noaudit() */
+#include <linux/slab.h> /* alloc(), kfree() */
+#include <linux/types.h> /* atomic_t */
+#include <linux/uaccess.h> /* copy_from_user() */
+
+#include "common.h"
+
+static void put_landlock_rule(struct landlock_rule *rule)
+{
+	struct landlock_rule *orig = rule;
+
+	/* clean up single-reference branches iteratively */
+	while (orig && atomic_dec_and_test(&orig->usage)) {
+		struct landlock_rule *freeme = orig;
+
+		bpf_prog_put(orig->prog);
+		orig = orig->prev;
+		kfree(freeme);
+	}
+}
+
+static void put_landlock_node(struct landlock_node *node)
+{
+	struct landlock_node *orig = node;
+
+	/* clean up single-reference branches iteratively */
+	while (orig && atomic_dec_and_test(&orig->usage)) {
+		struct landlock_node *freeme = orig;
+
+		put_landlock_rule(orig->rule);
+		orig = orig->prev;
+		kfree(freeme);
+	}
+}
+
+void put_landlock_events(struct landlock_events *events)
+{
+	if (events && atomic_dec_and_test(&events->usage)) {
+		size_t i;
+
+		/* XXX: Do we need to use lockless_dereference() here? */
+		for (i = 0; i < ARRAY_SIZE(events->nodes); i++) {
+			if (!events->nodes[i])
+				continue;
+			/* Are we the owner of this node? */
+			if (events->nodes[i]->owner == &events->nodes[i])
+				events->nodes[i]->owner = NULL;
+			put_landlock_node(events->nodes[i]);
+		}
+		kfree(events);
+	}
+}
+
+static struct landlock_events *new_raw_landlock_events(void)
+{
+	struct landlock_events *ret;
+
+	/* array filled with NULL values */
+	ret = kzalloc(sizeof(*ret), GFP_KERNEL);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
+	atomic_set(&ret->usage, 1);
+	return ret;
+}
+
+static struct landlock_events *new_filled_landlock_events(void)
+{
+	size_t i;
+	struct landlock_events *ret;
+
+	ret = new_raw_landlock_events();
+	if (IS_ERR(ret))
+		return ret;
+	/*
+	 * We need to initially allocate every nodes to be able to update the
+	 * rules they are pointing to, across every (future) children of the
+	 * current task.
+	 */
+	for (i = 0; i < ARRAY_SIZE(ret->nodes); i++) {
+		struct landlock_node *node;
+
+		node = kzalloc(sizeof(*node), GFP_KERNEL);
+		if (!node)
+			goto put_events;
+		atomic_set(&node->usage, 1);
+		/* we are the owner of this node */
+		node->owner = &ret->nodes[i];
+		ret->nodes[i] = node;
+	}
+	return ret;
+
+put_events:
+	put_landlock_events(ret);
+	return ERR_PTR(-ENOMEM);
+}
+
+static void add_landlock_rule(struct landlock_events *events,
+		struct landlock_rule *rule)
+{
+	/* subtype.landlock_rule.event > 0 for loaded programs */
+	u32 event_idx = get_index(rule->prog->subtype.landlock_rule.event);
+
+	rule->prev = events->nodes[event_idx]->rule;
+	WARN_ON(atomic_read(&rule->usage));
+	atomic_set(&rule->usage, 1);
+	/* do not increment the previous rule usage */
+	smp_store_release(&events->nodes[event_idx]->rule, rule);
+}
+
+/* Limit Landlock events to 256KB. */
+#define LANDLOCK_EVENTS_MAX_PAGES (1 << 6)
+
+/**
+ * landlock_append_prog - attach a Landlock rule to @current_events
+ *
+ * @current_events: landlock_events pointer, must be locked (if needed) to
+ *                  prevent a concurrent put/free. This pointer must not be
+ *                  freed after the call.
+ * @prog: non-NULL Landlock rule to append to @current_events. @prog will be
+ *        owned by landlock_append_prog() and freed if an error happened.
+ *
+ * Return @current_events or a new pointer when OK. Return a pointer error
+ * otherwise.
+ */
+static struct landlock_events *landlock_append_prog(
+		struct landlock_events *current_events, struct bpf_prog *prog)
+{
+	struct landlock_events *new_events = current_events;
+	unsigned long pages;
+	struct landlock_rule *rule;
+	u32 event_idx;
+
+	if (prog->type != BPF_PROG_TYPE_LANDLOCK) {
+		new_events = ERR_PTR(-EINVAL);
+		goto put_prog;
+	}
+
+	/* validate memory size allocation */
+	pages = prog->pages;
+	if (current_events) {
+		size_t i;
+
+		for (i = 0; i < ARRAY_SIZE(current_events->nodes); i++) {
+			struct landlock_node *walker_n;
+
+			for (walker_n = current_events->nodes[i];
+					walker_n;
+					walker_n = walker_n->prev) {
+				struct landlock_rule *walker_r;
+
+				for (walker_r = walker_n->rule;
+						walker_r;
+						walker_r = walker_r->prev)
+					pages += walker_r->prog->pages;
+			}
+		}
+		/* count a struct landlock_events if we need to allocate one */
+		if (atomic_read(&current_events->usage) != 1)
+			pages += round_up(sizeof(*current_events), PAGE_SIZE) /
+				PAGE_SIZE;
+	}
+	if (pages > LANDLOCK_EVENTS_MAX_PAGES) {
+		new_events = ERR_PTR(-E2BIG);
+		goto put_prog;
+	}
+
+	rule = kzalloc(sizeof(*rule), GFP_KERNEL);
+	if (!rule) {
+		new_events = ERR_PTR(-ENOMEM);
+		goto put_prog;
+	}
+	rule->prog = prog;
+
+	/* subtype.landlock_rule.event > 0 for loaded programs */
+	event_idx = get_index(rule->prog->subtype.landlock_rule.event);
+
+	if (!current_events) {
+		/* add a new landlock_events, if needed */
+		new_events = new_filled_landlock_events();
+		if (IS_ERR(new_events))
+			goto put_rule;
+		add_landlock_rule(new_events, rule);
+	} else {
+		if (new_events->nodes[event_idx]->owner ==
+				&new_events->nodes[event_idx]) {
+			/* We are the owner, we can then update the node. */
+			add_landlock_rule(new_events, rule);
+		} else if (atomic_read(&current_events->usage) == 1) {
+			WARN_ON(new_events->nodes[event_idx]->owner);
+			/*
+			 * We can become the new owner if no other task use it.
+			 * This avoid an unnecessary allocation.
+			 */
+			new_events->nodes[event_idx]->owner =
+				&new_events->nodes[event_idx];
+			add_landlock_rule(new_events, rule);
+		} else {
+			/*
+			 * We are not the owner, we need to fork current_events
+			 * and then add a new node.
+			 */
+			struct landlock_node *node;
+			size_t i;
+
+			node = kmalloc(sizeof(*node), GFP_KERNEL);
+			if (!node) {
+				new_events = ERR_PTR(-ENOMEM);
+				goto put_rule;
+			}
+			atomic_set(&node->usage, 1);
+			/* set the previous node after the new_events
+			 * allocation */
+			node->prev = NULL;
+			/* do not increment the previous node usage */
+			node->owner = &new_events->nodes[event_idx];
+			/* rule->prev is already NULL */
+			atomic_set(&rule->usage, 1);
+			node->rule = rule;
+
+			new_events = new_raw_landlock_events();
+			if (IS_ERR(new_events)) {
+				/* put the rule as well */
+				put_landlock_node(node);
+				return ERR_PTR(-ENOMEM);
+			}
+			for (i = 0; i < ARRAY_SIZE(new_events->nodes); i++) {
+				new_events->nodes[i] =
+					lockless_dereference(
+							current_events->nodes[i]);
+				if (i == event_idx)
+					node->prev = new_events->nodes[i];
+				if (!WARN_ON(!new_events->nodes[i]))
+					atomic_inc(&new_events->nodes[i]->usage);
+			}
+			new_events->nodes[event_idx] = node;
+
+			/*
+			 * @current_events will not be freed here because it's usage
+			 * field is > 1. It is only prevented to be freed by another
+			 * subject thanks to the caller of landlock_append_prog() which
+			 * should be locked if needed.
+			 */
+			put_landlock_events(current_events);
+		}
+	}
+	return new_events;
+
+put_prog:
+	bpf_prog_put(prog);
+	return new_events;
+
+put_rule:
+	put_landlock_rule(rule);
+	return new_events;
+}
+
+/**
+ * landlock_seccomp_append_prog - attach a Landlock rule to the current process
+ *
+ * current->seccomp.landlock_events is lazily allocated. When a process fork,
+ * only a pointer is copied. When a new event is added by a process, if there
+ * is other references to this process' landlock_events, then a new allocation
+ * is made to contains an array pointing to Landlock rule lists. This design
+ * has low-performance impact and is memory efficient while keeping the
+ * property of append-only rules.
+ *
+ * @flags: not used for now, but could be used for TSYNC
+ * @user_bpf_fd: file descriptor pointing to a loaded Landlock rule
+ */
+#ifdef CONFIG_SECCOMP_FILTER
+int landlock_seccomp_append_prog(unsigned int flags, const char __user *user_bpf_fd)
+{
+	struct landlock_events *new_events;
+	struct bpf_prog *prog;
+	int bpf_fd;
+
+	/* force no_new_privs to limit privilege escalation */
+	if (!task_no_new_privs(current))
+		return -EPERM;
+	/* will be removed in the future to allow unprivileged tasks */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (!user_bpf_fd)
+		return -EFAULT;
+	if (flags)
+		return -EINVAL;
+	if (copy_from_user(&bpf_fd, user_bpf_fd, sizeof(bpf_fd)))
+		return -EFAULT;
+	prog = bpf_prog_get(bpf_fd);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	/*
+	 * We don't need to lock anything for the current process hierarchy,
+	 * everything is guarded by the atomic counters.
+	 */
+	new_events = landlock_append_prog(current->seccomp.landlock_events, prog);
+	/* @prog is managed/freed by landlock_append_prog() */
+	if (IS_ERR(new_events))
+		return PTR_ERR(new_events);
+	current->seccomp.landlock_events = new_events;
+	return 0;
+}
+#endif /* CONFIG_SECCOMP_FILTER */