mbox series

[RFC,v14,00/10] Landlock LSM

Message ID 20200224160215.4136-1-mic@digikod.net (mailing list archive)
Headers show
Series Landlock LSM | expand

Message

Mickaël Salaün Feb. 24, 2020, 4:02 p.m. UTC
Hi,

This new version of Landlock is a major revamp of the previous series
[1], hence the RFC tag.  The three main changes are the replacement of
eBPF with a dedicated safe management of access rules, the replacement
of the use of seccomp(2) with a dedicated syscall, and the management of
filesystem access-control (back from the v10).

As discussed in [2], eBPF may be too powerful and dangerous to be put in
the hand of unprivileged and potentially malicious processes, especially
because of side-channel attacks against access-controls or other parts
of the kernel.

Thanks to this new implementation (1540 SLOC), designed from the ground
to be used by unprivileged processes, this series enables a process to
sandbox itself without requiring CAP_SYS_ADMIN, but only the
no_new_privs constraint (like seccomp).  Not relying on eBPF also
enables to improve performances, especially for stacked security
policies thanks to mergeable rulesets.

The compiled documentation is available here:
https://landlock.io/linux-doc/landlock-v14/security/landlock/index.html

This series can be applied on top of v5.6-rc3.  This can be tested with
CONFIG_SECURITY_LANDLOCK and CONFIG_SAMPLE_LANDLOCK.  This patch series
can be found in a Git repository here:
https://github.com/landlock-lsm/linux/commits/landlock-v14
I would really appreciate constructive comments on the design and the code.


# Landlock LSM

The goal of Landlock is to enable to restrict ambient rights (e.g.
global filesystem access) for a set of processes.  Because Landlock is a
stackable LSM [3], it makes possible to create safe security sandboxes
as new security layers in addition to the existing system-wide
access-controls. This kind of sandbox is expected to help mitigate the
security impact of bugs or unexpected/malicious behaviors in user-space
applications. Landlock empower any process, including unprivileged ones,
to securely restrict themselves.

Landlock is inspired by seccomp-bpf but instead of filtering syscalls
and their raw arguments, a Landlock rule can restrict the use of kernel
objects like file hierarchies, according to the kernel semantic.
Landlock also takes inspiration from other OS sandbox mechanisms: XNU
Sandbox, FreeBSD Capsicum or OpenBSD Pledge/Unveil.


# Current limitations

## Path walk

Landlock need to use dentries to identify a file hierarchy, which is
needed for composable and unprivileged access-controls. This means that
path resolution/walking (handled with inode_permission()) is not
supported, yet. This could be filled with a future extension first of
the LSM framework. The Landlock userspace ABI can handle such change
with new option (e.g. to the struct landlock_ruleset).

## UnionFS

An UnionFS super-block use a set of upper and lower directories. An
access request to a file in one of these hierarchy trigger a call to
ovl_path_real() which generate another access request according to the
matching hierarchy. Because such super-block is not aware of its current
mount point, OverlayFS can't create a dedicated mnt_parent for each of
the upper and lower directories mount clones. It is then not currently
possible to track the source of such indirect access-request, and then
not possible to identify a unified OverlayFS hierarchy.

## Syscall

Because it is only tested on x86_64, the syscall is only wired up for
this architecture.  The whole x86 family (and probably all the others)
will be supported in the next patch series.


## Memory limits

There is currently no limit on the memory usage.  Any idea to leverage
an existing mechanism (e.g. rlimit)?


# Changes since v13

* Revamp of the LSM: remove the need for eBPF and seccomp(2).
* Implement a full filesystem access-control.
* Take care of the backward compatibility issues, especially for
  this security features.

Previous version:
https://lore.kernel.org/lkml/20191104172146.30797-1-mic@digikod.net/

[1] https://lore.kernel.org/lkml/20191104172146.30797-1-mic@digikod.net/
[2] https://lore.kernel.org/lkml/a6b61f33-82dc-0c1c-7a6c-1926343ef63e@digikod.net/
[3] https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/

Regards,

Mickaël Salaün (10):
  landlock: Add object and rule management
  landlock: Add ruleset and domain management
  landlock: Set up the security framework and manage credentials
  landlock: Add ptrace restrictions
  fs,landlock: Support filesystem access-control
  landlock: Add syscall implementation
  arch: Wire up landlock() syscall
  selftests/landlock: Add initial tests
  samples/landlock: Add a sandbox manager example
  landlock: Add user and kernel documentation

 Documentation/security/index.rst              |   1 +
 Documentation/security/landlock/index.rst     |  18 +
 Documentation/security/landlock/kernel.rst    |  44 ++
 Documentation/security/landlock/user.rst      | 233 +++++++
 MAINTAINERS                                   |  12 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 fs/super.c                                    |   2 +
 include/linux/landlock.h                      |  22 +
 include/linux/syscalls.h                      |   3 +
 include/uapi/asm-generic/unistd.h             |   4 +-
 include/uapi/linux/landlock.h                 | 315 +++++++++
 samples/Kconfig                               |   7 +
 samples/Makefile                              |   1 +
 samples/landlock/.gitignore                   |   1 +
 samples/landlock/Makefile                     |  15 +
 samples/landlock/sandboxer.c                  | 226 +++++++
 security/Kconfig                              |  11 +-
 security/Makefile                             |   2 +
 security/landlock/Kconfig                     |  16 +
 security/landlock/Makefile                    |   4 +
 security/landlock/cred.c                      |  47 ++
 security/landlock/cred.h                      |  55 ++
 security/landlock/fs.c                        | 591 +++++++++++++++++
 security/landlock/fs.h                        |  42 ++
 security/landlock/object.c                    | 341 ++++++++++
 security/landlock/object.h                    | 134 ++++
 security/landlock/ptrace.c                    | 118 ++++
 security/landlock/ptrace.h                    |  14 +
 security/landlock/ruleset.c                   | 463 +++++++++++++
 security/landlock/ruleset.h                   | 106 +++
 security/landlock/setup.c                     |  38 ++
 security/landlock/setup.h                     |  20 +
 security/landlock/syscall.c                   | 470 +++++++++++++
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/landlock/.gitignore   |   3 +
 tools/testing/selftests/landlock/Makefile     |  13 +
 tools/testing/selftests/landlock/config       |   4 +
 tools/testing/selftests/landlock/test.h       |  40 ++
 tools/testing/selftests/landlock/test_base.c  |  80 +++
 tools/testing/selftests/landlock/test_fs.c    | 624 ++++++++++++++++++
 .../testing/selftests/landlock/test_ptrace.c  | 293 ++++++++
 41 files changed, 4429 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/security/landlock/index.rst
 create mode 100644 Documentation/security/landlock/kernel.rst
 create mode 100644 Documentation/security/landlock/user.rst
 create mode 100644 include/linux/landlock.h
 create mode 100644 include/uapi/linux/landlock.h
 create mode 100644 samples/landlock/.gitignore
 create mode 100644 samples/landlock/Makefile
 create mode 100644 samples/landlock/sandboxer.c
 create mode 100644 security/landlock/Kconfig
 create mode 100644 security/landlock/Makefile
 create mode 100644 security/landlock/cred.c
 create mode 100644 security/landlock/cred.h
 create mode 100644 security/landlock/fs.c
 create mode 100644 security/landlock/fs.h
 create mode 100644 security/landlock/object.c
 create mode 100644 security/landlock/object.h
 create mode 100644 security/landlock/ptrace.c
 create mode 100644 security/landlock/ptrace.h
 create mode 100644 security/landlock/ruleset.c
 create mode 100644 security/landlock/ruleset.h
 create mode 100644 security/landlock/setup.c
 create mode 100644 security/landlock/setup.h
 create mode 100644 security/landlock/syscall.c
 create mode 100644 tools/testing/selftests/landlock/.gitignore
 create mode 100644 tools/testing/selftests/landlock/Makefile
 create mode 100644 tools/testing/selftests/landlock/config
 create mode 100644 tools/testing/selftests/landlock/test.h
 create mode 100644 tools/testing/selftests/landlock/test_base.c
 create mode 100644 tools/testing/selftests/landlock/test_fs.c
 create mode 100644 tools/testing/selftests/landlock/test_ptrace.c

Comments

Jay Freyensee Feb. 25, 2020, 6:49 p.m. UTC | #1
On 2/24/20 8:02 AM, Mickaël Salaün wrote:

> ## Syscall
>
> Because it is only tested on x86_64, the syscall is only wired up for
> this architecture.  The whole x86 family (and probably all the others)
> will be supported in the next patch series.
General question for u.  What is it meant "whole x86 family will be 
supported".  32-bit x86 will be supported?

Thanks,
Jay
Mickaël Salaün Feb. 26, 2020, 3:34 p.m. UTC | #2
On 25/02/2020 19:49, J Freyensee wrote:
> 
> 
> On 2/24/20 8:02 AM, Mickaël Salaün wrote:
> 
>> ## Syscall
>>
>> Because it is only tested on x86_64, the syscall is only wired up for
>> this architecture.  The whole x86 family (and probably all the others)
>> will be supported in the next patch series.
> General question for u.  What is it meant "whole x86 family will be
> supported".  32-bit x86 will be supported?

Yes, I was referring to x86_32, x86_64 and x32, but all architectures
should be supported.
Hillf Danton Feb. 27, 2020, 4:20 a.m. UTC | #3
On Mon, 24 Feb 2020 17:02:06 +0100 Mickaël Salaün 
> A Landlock object enables to identify a kernel object (e.g. an inode).
> A Landlock rule is a set of access rights allowed on an object.  Rules
> are grouped in rulesets that may be tied to a set of processes (i.e.
> subjects) to enforce a scoped access-control (i.e. a domain).
> 
> Because Landlock's goal is to empower any process (especially
> unprivileged ones) to sandbox themselves, we can't rely on a system-wide
> object identification such as file extended attributes.  Indeed, we need
> innocuous, composable and modular access-controls.
> 
> The main challenge with this constraints is to identify kernel objects
> while this identification is useful (i.e. when a security policy makes
> use of this object).  But this identification data should be freed once
> no policy is using it.  This ephemeral tagging should not and may not be
> written in the filesystem.  We then need to manage the lifetime of a
> rule according to the lifetime of its object.  To avoid a global lock,
> this implementation make use of RCU and counters to safely reference
> objects.
> 
> A following commit uses this generic object management for inodes.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: James Morris <jmorris@namei.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Serge E. Hallyn <serge@hallyn.com>
> ---
> 
> Changes since v13:
> * New dedicated implementation, removing the need for eBPF.
> 
> Previous version:
> https://lore.kernel.org/lkml/20190721213116.23476-6-mic@digikod.net/
> ---
>  MAINTAINERS                |  10 ++
>  security/Kconfig           |   1 +
>  security/Makefile          |   2 +
>  security/landlock/Kconfig  |  15 ++
>  security/landlock/Makefile |   3 +
>  security/landlock/object.c | 339 +++++++++++++++++++++++++++++++++++++
>  security/landlock/object.h | 134 +++++++++++++++
>  7 files changed, 504 insertions(+)
>  create mode 100644 security/landlock/Kconfig
>  create mode 100644 security/landlock/Makefile
>  create mode 100644 security/landlock/object.c
>  create mode 100644 security/landlock/object.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fcd79fc38928..206f85768cd9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -9360,6 +9360,16 @@ F:	net/core/skmsg.c
>  F:	net/core/sock_map.c
>  F:	net/ipv4/tcp_bpf.c
>  
> +LANDLOCK SECURITY MODULE
> +M:	Mickaël Salaün <mic@digikod.net>
> +L:	linux-security-module@vger.kernel.org
> +W:	https://landlock.io
> +T:	git https://github.com/landlock-lsm/linux.git
> +S:	Supported
> +F:	security/landlock/
> +K:	landlock
> +K:	LANDLOCK
> +
>  LANTIQ / INTEL Ethernet drivers
>  M:	Hauke Mehrtens <hauke@hauke-m.de>
>  L:	netdev@vger.kernel.org
> diff --git a/security/Kconfig b/security/Kconfig
> index 2a1a2d396228..9d9981394fb0 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -238,6 +238,7 @@ source "security/loadpin/Kconfig"
>  source "security/yama/Kconfig"
>  source "security/safesetid/Kconfig"
>  source "security/lockdown/Kconfig"
> +source "security/landlock/Kconfig"
>  
>  source "security/integrity/Kconfig"
>  
> diff --git a/security/Makefile b/security/Makefile
> index 746438499029..2472ef96d40a 100644
> --- a/security/Makefile
> +++ b/security/Makefile
> @@ -12,6 +12,7 @@ subdir-$(CONFIG_SECURITY_YAMA)		+= yama
>  subdir-$(CONFIG_SECURITY_LOADPIN)	+= loadpin
>  subdir-$(CONFIG_SECURITY_SAFESETID)    += safesetid
>  subdir-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown
> +subdir-$(CONFIG_SECURITY_LANDLOCK)		+= landlock
>  
>  # always enable default capabilities
>  obj-y					+= commoncap.o
> @@ -29,6 +30,7 @@ obj-$(CONFIG_SECURITY_YAMA)		+= yama/
>  obj-$(CONFIG_SECURITY_LOADPIN)		+= loadpin/
>  obj-$(CONFIG_SECURITY_SAFESETID)       += safesetid/
>  obj-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown/
> +obj-$(CONFIG_SECURITY_LANDLOCK)	+= landlock/
>  obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
>  
>  # Object integrity file lists
> diff --git a/security/landlock/Kconfig b/security/landlock/Kconfig
> new file mode 100644
> index 000000000000..4a321d5b3f67
> --- /dev/null
> +++ b/security/landlock/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +
> +config SECURITY_LANDLOCK
> +	bool "Landlock support"
> +	depends on SECURITY
> +	default n
> +	help
> +	  This selects Landlock, a safe sandboxing mechanism.  It enables to
> +	  restrict processes on the fly (i.e. enforce an access control policy),
> +	  which can complement seccomp-bpf.  The security policy is a set of access
> +	  rights tied to an object, which could be a file, a socket or a process.
> +
> +	  See Documentation/security/landlock/ for further information.
> +
> +	  If you are unsure how to answer this question, answer N.
> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
> new file mode 100644
> index 000000000000..cb6deefbf4c0
> --- /dev/null
> +++ b/security/landlock/Makefile
> @@ -0,0 +1,3 @@
> +obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
> +
> +landlock-y := object.o
> diff --git a/security/landlock/object.c b/security/landlock/object.c
> new file mode 100644
> index 000000000000..38fbbb108120
> --- /dev/null
> +++ b/security/landlock/object.c
> @@ -0,0 +1,339 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Landlock LSM - Object and rule management
> + *
> + * Copyright © 2016-2020 Mickaël Salaün <mic@digikod.net>
> + * Copyright © 2018-2020 ANSSI
> + *
> + * Principles and constraints of the object and rule management:
> + * - Do not leak memory.
> + * - Try as much as possible to free a memory allocation as soon as it is
> + *   unused.
> + * - Do not use global lock.
> + * - Do not charge processes other than the one requesting a Landlock
> + *   operation.
> + */
> +
> +#include <linux/bug.h>
> +#include <linux/compiler.h>
> +#include <linux/compiler_types.h>
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +#include <linux/fs.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/rbtree.h>
> +#include <linux/rcupdate.h>
> +#include <linux/refcount.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/workqueue.h>
> +
> +#include "object.h"
> +
> +struct landlock_object *landlock_create_object(
> +		const enum landlock_object_type type, void *underlying_object)
> +{
> +	struct landlock_object *object;
> +
> +	if (WARN_ON_ONCE(!underlying_object))
> +		return NULL;
> +	object = kzalloc(sizeof(*object), GFP_KERNEL);
> +	if (!object)
> +		return NULL;
> +	refcount_set(&object->usage, 1);
> +	refcount_set(&object->cleaners, 1);
> +	spin_lock_init(&object->lock);
> +	INIT_LIST_HEAD(&object->rules);
> +	object->type = type;
> +	WRITE_ONCE(object->underlying_object, underlying_object);
> +	return object;
> +}
> +
> +struct landlock_object *landlock_get_object(struct landlock_object *object)
> +	__acquires(object->usage)
> +{
> +	__acquire(object->usage);
> +	/*
> +	 * If @object->usage equal 0, then it will be ignored by writers, and
> +	 * underlying_object->object may be replaced, but this is not an issue
> +	 * for release_object().
> +	 */
> +	if (object && refcount_inc_not_zero(&object->usage)) {
> +		/*
> +		 * It should not be possible to get a reference to an object if
> +		 * its underlying object is being terminated (e.g. with
> +		 * landlock_release_object()), because an object is only
> +		 * modifiable through such underlying object.  This is not the
> +		 * case with landlock_get_object_cleaner().
> +		 */
> +		WARN_ON_ONCE(!READ_ONCE(object->underlying_object));
> +		return object;
> +	}
> +	return NULL;
> +}
> +
> +static struct landlock_object *get_object_cleaner(
> +		struct landlock_object *object)
> +	__acquires(object->cleaners)
> +{
> +	__acquire(object->cleaners);
> +	if (object && refcount_inc_not_zero(&object->cleaners))
> +		return object;
> +	return NULL;
> +}
> +
> +/*
> + * There is two cases when an object should be free and the reference to the
> + * underlying object should be put:
> + * - when the last rule tied to this object is removed, which is handled by
> + *   landlock_put_rule() and then release_object();
> + * - when the object is being terminated (e.g. no more reference to an inode),
> + *   which is handled by landlock_put_object().
> + */
> +static void put_object_free(struct landlock_object *object)
> +	__releases(object->cleaners)
> +{
> +	__release(object->cleaners);
> +	if (!refcount_dec_and_test(&object->cleaners))
> +		return;
> +	WARN_ON_ONCE(refcount_read(&object->usage));
> +	/*
> +	 * Ensures a safe use of @object in the RCU block from
> +	 * landlock_put_rule().
> +	 */
> +	kfree_rcu(object, rcu_free);
> +}
> +
> +/*
> + * Destroys a newly created and useless object.
> + */
> +void landlock_drop_object(struct landlock_object *object)
> +{
> +	if (WARN_ON_ONCE(!refcount_dec_and_test(&object->usage)))
> +		return;
> +	__acquire(object->cleaners);
> +	put_object_free(object);
> +}
> +
> +/*
> + * Puts the underlying object (e.g. inode) if it is the first request to
> + * release @object, without calling landlock_put_object().
> + *
> + * Return true if this call effectively marks @object as released, false
> + * otherwise.
> + */
> +static bool release_object(struct landlock_object *object)
> +	__releases(&object->lock)
> +{
> +	void *underlying_object;
> +
> +	lockdep_assert_held(&object->lock);
> +
> +	underlying_object = xchg(&object->underlying_object, NULL);

A one-line comment looks needed for xchg.

> +	spin_unlock(&object->lock);
> +	might_sleep();

Have trouble working out what might_sleep is put for.

> +	if (!underlying_object)
> +		return false;
> +
> +	switch (object->type) {
> +	case LANDLOCK_OBJECT_INODE:
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +	}
> +	return true;
> +}
> +
> +static void put_object_cleaner(struct landlock_object *object)
> +	__releases(object->cleaners)
> +{
> +	/* Let's try an early lockless check. */
> +	if (list_empty(&object->rules) &&
> +			READ_ONCE(object->underlying_object)) {
> +		/*
> +		 * Puts @object if there is no rule tied to it and the
> +		 * remaining user is the underlying object.  This check is
> +		 * atomic because @object->rules and @object->underlying_object
> +		 * are protected by @object->lock.
> +		 */
> +		spin_lock(&object->lock);
> +		if (list_empty(&object->rules) &&
> +				READ_ONCE(object->underlying_object) &&
> +				refcount_dec_if_one(&object->usage)) {
> +			/*
> +			 * Releases @object, in place of
> +			 * landlock_release_object().
> +			 *
> +			 * @object is already empty, implying that all its
> +			 * previous rules are already disabled.
> +			 *
> +			 * Unbalance the @object->cleaners counter to reflect
> +			 * the underlying object release.
> +			 */
> +			if (!WARN_ON_ONCE(!release_object(object))) {

Two ! hurt more than help.
> +				__acquire(object->cleaners);
> +				put_object_free(object);

Why put object more than once?

> +			}
> +		} else {
> +			spin_unlock(&object->lock);
> +		}
> +	}
> +	put_object_free(object);
> +}
> +
Mickaël Salaün Feb. 27, 2020, 5:01 p.m. UTC | #4
On 27/02/2020 05:20, Hillf Danton wrote:
> 
> On Mon, 24 Feb 2020 17:02:06 +0100 Mickaël Salaün 
>> A Landlock object enables to identify a kernel object (e.g. an inode).
>> A Landlock rule is a set of access rights allowed on an object.  Rules
>> are grouped in rulesets that may be tied to a set of processes (i.e.
>> subjects) to enforce a scoped access-control (i.e. a domain).
>>
>> Because Landlock's goal is to empower any process (especially
>> unprivileged ones) to sandbox themselves, we can't rely on a system-wide
>> object identification such as file extended attributes.  Indeed, we need
>> innocuous, composable and modular access-controls.
>>
>> The main challenge with this constraints is to identify kernel objects
>> while this identification is useful (i.e. when a security policy makes
>> use of this object).  But this identification data should be freed once
>> no policy is using it.  This ephemeral tagging should not and may not be
>> written in the filesystem.  We then need to manage the lifetime of a
>> rule according to the lifetime of its object.  To avoid a global lock,
>> this implementation make use of RCU and counters to safely reference
>> objects.
>>
>> A following commit uses this generic object management for inodes.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: James Morris <jmorris@namei.org>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Serge E. Hallyn <serge@hallyn.com>
>> ---
>>
>> Changes since v13:
>> * New dedicated implementation, removing the need for eBPF.
>>
>> Previous version:
>> https://lore.kernel.org/lkml/20190721213116.23476-6-mic@digikod.net/
>> ---
>>  MAINTAINERS                |  10 ++
>>  security/Kconfig           |   1 +
>>  security/Makefile          |   2 +
>>  security/landlock/Kconfig  |  15 ++
>>  security/landlock/Makefile |   3 +
>>  security/landlock/object.c | 339 +++++++++++++++++++++++++++++++++++++
>>  security/landlock/object.h | 134 +++++++++++++++
>>  7 files changed, 504 insertions(+)
>>  create mode 100644 security/landlock/Kconfig
>>  create mode 100644 security/landlock/Makefile
>>  create mode 100644 security/landlock/object.c
>>  create mode 100644 security/landlock/object.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index fcd79fc38928..206f85768cd9 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -9360,6 +9360,16 @@ F:	net/core/skmsg.c
>>  F:	net/core/sock_map.c
>>  F:	net/ipv4/tcp_bpf.c
>>  
>> +LANDLOCK SECURITY MODULE
>> +M:	Mickaël Salaün <mic@digikod.net>
>> +L:	linux-security-module@vger.kernel.org
>> +W:	https://landlock.io
>> +T:	git https://github.com/landlock-lsm/linux.git
>> +S:	Supported
>> +F:	security/landlock/
>> +K:	landlock
>> +K:	LANDLOCK
>> +
>>  LANTIQ / INTEL Ethernet drivers
>>  M:	Hauke Mehrtens <hauke@hauke-m.de>
>>  L:	netdev@vger.kernel.org
>> diff --git a/security/Kconfig b/security/Kconfig
>> index 2a1a2d396228..9d9981394fb0 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -238,6 +238,7 @@ source "security/loadpin/Kconfig"
>>  source "security/yama/Kconfig"
>>  source "security/safesetid/Kconfig"
>>  source "security/lockdown/Kconfig"
>> +source "security/landlock/Kconfig"
>>  
>>  source "security/integrity/Kconfig"
>>  
>> diff --git a/security/Makefile b/security/Makefile
>> index 746438499029..2472ef96d40a 100644
>> --- a/security/Makefile
>> +++ b/security/Makefile
>> @@ -12,6 +12,7 @@ subdir-$(CONFIG_SECURITY_YAMA)		+= yama
>>  subdir-$(CONFIG_SECURITY_LOADPIN)	+= loadpin
>>  subdir-$(CONFIG_SECURITY_SAFESETID)    += safesetid
>>  subdir-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown
>> +subdir-$(CONFIG_SECURITY_LANDLOCK)		+= landlock
>>  
>>  # always enable default capabilities
>>  obj-y					+= commoncap.o
>> @@ -29,6 +30,7 @@ obj-$(CONFIG_SECURITY_YAMA)		+= yama/
>>  obj-$(CONFIG_SECURITY_LOADPIN)		+= loadpin/
>>  obj-$(CONFIG_SECURITY_SAFESETID)       += safesetid/
>>  obj-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown/
>> +obj-$(CONFIG_SECURITY_LANDLOCK)	+= landlock/
>>  obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
>>  
>>  # Object integrity file lists
>> diff --git a/security/landlock/Kconfig b/security/landlock/Kconfig
>> new file mode 100644
>> index 000000000000..4a321d5b3f67
>> --- /dev/null
>> +++ b/security/landlock/Kconfig
>> @@ -0,0 +1,15 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +
>> +config SECURITY_LANDLOCK
>> +	bool "Landlock support"
>> +	depends on SECURITY
>> +	default n
>> +	help
>> +	  This selects Landlock, a safe sandboxing mechanism.  It enables to
>> +	  restrict processes on the fly (i.e. enforce an access control policy),
>> +	  which can complement seccomp-bpf.  The security policy is a set of access
>> +	  rights tied to an object, which could be a file, a socket or a process.
>> +
>> +	  See Documentation/security/landlock/ for further information.
>> +
>> +	  If you are unsure how to answer this question, answer N.
>> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
>> new file mode 100644
>> index 000000000000..cb6deefbf4c0
>> --- /dev/null
>> +++ b/security/landlock/Makefile
>> @@ -0,0 +1,3 @@
>> +obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>> +
>> +landlock-y := object.o
>> diff --git a/security/landlock/object.c b/security/landlock/object.c
>> new file mode 100644
>> index 000000000000..38fbbb108120
>> --- /dev/null
>> +++ b/security/landlock/object.c
>> @@ -0,0 +1,339 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Landlock LSM - Object and rule management
>> + *
>> + * Copyright © 2016-2020 Mickaël Salaün <mic@digikod.net>
>> + * Copyright © 2018-2020 ANSSI
>> + *
>> + * Principles and constraints of the object and rule management:
>> + * - Do not leak memory.
>> + * - Try as much as possible to free a memory allocation as soon as it is
>> + *   unused.
>> + * - Do not use global lock.
>> + * - Do not charge processes other than the one requesting a Landlock
>> + *   operation.
>> + */
>> +
>> +#include <linux/bug.h>
>> +#include <linux/compiler.h>
>> +#include <linux/compiler_types.h>
>> +#include <linux/err.h>
>> +#include <linux/errno.h>
>> +#include <linux/fs.h>
>> +#include <linux/kernel.h>
>> +#include <linux/list.h>
>> +#include <linux/rbtree.h>
>> +#include <linux/rcupdate.h>
>> +#include <linux/refcount.h>
>> +#include <linux/slab.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/workqueue.h>
>> +
>> +#include "object.h"
>> +
>> +struct landlock_object *landlock_create_object(
>> +		const enum landlock_object_type type, void *underlying_object)
>> +{
>> +	struct landlock_object *object;
>> +
>> +	if (WARN_ON_ONCE(!underlying_object))
>> +		return NULL;
>> +	object = kzalloc(sizeof(*object), GFP_KERNEL);
>> +	if (!object)
>> +		return NULL;
>> +	refcount_set(&object->usage, 1);
>> +	refcount_set(&object->cleaners, 1);
>> +	spin_lock_init(&object->lock);
>> +	INIT_LIST_HEAD(&object->rules);
>> +	object->type = type;
>> +	WRITE_ONCE(object->underlying_object, underlying_object);
>> +	return object;
>> +}
>> +
>> +struct landlock_object *landlock_get_object(struct landlock_object *object)
>> +	__acquires(object->usage)
>> +{
>> +	__acquire(object->usage);
>> +	/*
>> +	 * If @object->usage equal 0, then it will be ignored by writers, and
>> +	 * underlying_object->object may be replaced, but this is not an issue
>> +	 * for release_object().
>> +	 */
>> +	if (object && refcount_inc_not_zero(&object->usage)) {
>> +		/*
>> +		 * It should not be possible to get a reference to an object if
>> +		 * its underlying object is being terminated (e.g. with
>> +		 * landlock_release_object()), because an object is only
>> +		 * modifiable through such underlying object.  This is not the
>> +		 * case with landlock_get_object_cleaner().
>> +		 */
>> +		WARN_ON_ONCE(!READ_ONCE(object->underlying_object));
>> +		return object;
>> +	}
>> +	return NULL;
>> +}
>> +
>> +static struct landlock_object *get_object_cleaner(
>> +		struct landlock_object *object)
>> +	__acquires(object->cleaners)
>> +{
>> +	__acquire(object->cleaners);
>> +	if (object && refcount_inc_not_zero(&object->cleaners))
>> +		return object;
>> +	return NULL;
>> +}
>> +
>> +/*
>> + * There is two cases when an object should be free and the reference to the
>> + * underlying object should be put:
>> + * - when the last rule tied to this object is removed, which is handled by
>> + *   landlock_put_rule() and then release_object();
>> + * - when the object is being terminated (e.g. no more reference to an inode),
>> + *   which is handled by landlock_put_object().
>> + */
>> +static void put_object_free(struct landlock_object *object)
>> +	__releases(object->cleaners)
>> +{
>> +	__release(object->cleaners);
>> +	if (!refcount_dec_and_test(&object->cleaners))
>> +		return;
>> +	WARN_ON_ONCE(refcount_read(&object->usage));
>> +	/*
>> +	 * Ensures a safe use of @object in the RCU block from
>> +	 * landlock_put_rule().
>> +	 */
>> +	kfree_rcu(object, rcu_free);
>> +}
>> +
>> +/*
>> + * Destroys a newly created and useless object.
>> + */
>> +void landlock_drop_object(struct landlock_object *object)
>> +{
>> +	if (WARN_ON_ONCE(!refcount_dec_and_test(&object->usage)))
>> +		return;
>> +	__acquire(object->cleaners);
>> +	put_object_free(object);
>> +}
>> +
>> +/*
>> + * Puts the underlying object (e.g. inode) if it is the first request to
>> + * release @object, without calling landlock_put_object().
>> + *
>> + * Return true if this call effectively marks @object as released, false
>> + * otherwise.
>> + */
>> +static bool release_object(struct landlock_object *object)
>> +	__releases(&object->lock)
>> +{
>> +	void *underlying_object;
>> +
>> +	lockdep_assert_held(&object->lock);
>> +
>> +	underlying_object = xchg(&object->underlying_object, NULL);
> 
> A one-line comment looks needed for xchg.

Ok. This is to have a guarantee that the underlying_object (e.g. the
inode pointer) is only used once. I'll add a comment.

> 
>> +	spin_unlock(&object->lock);
>> +	might_sleep();
> 
> Have trouble working out what might_sleep is put for.

Patch 5 adds a call to landlock_release_inode(underlying_object, object)
(LANDLOCK_OBJECT_INODE case), which can sleep e.g., with a call to iput().

> 
>> +	if (!underlying_object)
>> +		return false;
>> +
>> +	switch (object->type) {
>> +	case LANDLOCK_OBJECT_INODE:
>> +		break;
>> +	default:
>> +		WARN_ON_ONCE(1);
>> +	}
>> +	return true;
>> +}
>> +
>> +static void put_object_cleaner(struct landlock_object *object)
>> +	__releases(object->cleaners)
>> +{
>> +	/* Let's try an early lockless check. */
>> +	if (list_empty(&object->rules) &&
>> +			READ_ONCE(object->underlying_object)) {
>> +		/*
>> +		 * Puts @object if there is no rule tied to it and the
>> +		 * remaining user is the underlying object.  This check is
>> +		 * atomic because @object->rules and @object->underlying_object
>> +		 * are protected by @object->lock.
>> +		 */
>> +		spin_lock(&object->lock);
>> +		if (list_empty(&object->rules) &&
>> +				READ_ONCE(object->underlying_object) &&
>> +				refcount_dec_if_one(&object->usage)) {
>> +			/*
>> +			 * Releases @object, in place of
>> +			 * landlock_release_object().
>> +			 *
>> +			 * @object is already empty, implying that all its
>> +			 * previous rules are already disabled.
>> +			 *
>> +			 * Unbalance the @object->cleaners counter to reflect
>> +			 * the underlying object release.
>> +			 */
>> +			if (!WARN_ON_ONCE(!release_object(object))) {
> 
> Two ! hurt more than help.

Well, it may not look nice but don't you think it is better than a
WARN_ON_ONCE(1) in the if block?

>> +				__acquire(object->cleaners);
>> +				put_object_free(object);
> 
> Why put object more than once?

I just replied to Jann about this subject. This is to "unbalance" the
counter to potentially free it (if there is no more user). I explain it
here:
https://lore.kernel.org/lkml/67465638-e22c-5d1a-df37-862b31d999a1@digikod.net/

> 
>> +			}
>> +		} else {
>> +			spin_unlock(&object->lock);
>> +		}
>> +	}
>> +	put_object_free(object);
>> +}
>> +
>
Jann Horn March 9, 2020, 11:44 p.m. UTC | #5
On Mon, Feb 24, 2020 at 5:03 PM Mickaël Salaün <mic@digikod.net> wrote:
> This new version of Landlock is a major revamp of the previous series
> [1], hence the RFC tag.  The three main changes are the replacement of
> eBPF with a dedicated safe management of access rules, the replacement
> of the use of seccomp(2) with a dedicated syscall, and the management of
> filesystem access-control (back from the v10).
>
> As discussed in [2], eBPF may be too powerful and dangerous to be put in
> the hand of unprivileged and potentially malicious processes, especially
> because of side-channel attacks against access-controls or other parts
> of the kernel.
>
> Thanks to this new implementation (1540 SLOC), designed from the ground
> to be used by unprivileged processes, this series enables a process to
> sandbox itself without requiring CAP_SYS_ADMIN, but only the
> no_new_privs constraint (like seccomp).  Not relying on eBPF also
> enables to improve performances, especially for stacked security
> policies thanks to mergeable rulesets.
>
> The compiled documentation is available here:
> https://landlock.io/linux-doc/landlock-v14/security/landlock/index.html
>
> This series can be applied on top of v5.6-rc3.  This can be tested with
> CONFIG_SECURITY_LANDLOCK and CONFIG_SAMPLE_LANDLOCK.  This patch series
> can be found in a Git repository here:
> https://github.com/landlock-lsm/linux/commits/landlock-v14
> I would really appreciate constructive comments on the design and the code.

I've looked through the patchset, and I think that it would be
possible to simplify it quite a bit. I have tried to do that (and
compiled-tested it, but not actually tried running it); here's what I
came up with:

https://github.com/thejh/linux/commits/landlock-mod

The three modified patches (patches 1, 2 and 5) are marked with
"[MODIFIED]" in their title. Please take a look - what do you think?
Feel free to integrate my changes into your patches if you think they
make sense.


Apart from simplifying the code, I also found the following issues,
which I have fixed in the modified patches:

put_hierarchy() has to drop a reference on its parent. (However, this
must not recurse, so we have to do it with a loop.)

put_ruleset() is not in an RCU read-side critical section, so as soon
as it calls kfree_rcu(), "freeme" might disappear; but "orig" is in
"freeme", so when the loop tries to find the next element with
rb_next(orig), that can be a UAF.
rbtree_postorder_for_each_entry_safe() exists for dealing with such
issues.

AFAIK the calls to rb_erase() in clean_ruleset() is not safe if
someone is concurrently accessing the rbtree as an RCU reader, because
concurrent rotations can prevent a lookup from succeeding. The
simplest fix is probably to just make any rbtree that has been
installed on a process immutable, and give up on the cleaning -
arguably the memory wastage that can cause is pretty limited. (By the
way, as a future optimization, we might want to turn the rbtree into a
hashtable when installing it?)

The iput() in landlock_release_inode() looks unsafe - you need to
guarantee that even if the deletion of a ruleset races with
generic_shutdown_super(), every iput() for that superblock finishes
before landlock_release_inodes() returns, even if the iput() is
happening in the context of ruleset deletion. This is why
fsnotify_unmount_inodes() has that wait_var_event() at the end.


Aside from those things, there is also a major correctness issue where
I'm not sure how to solve it properly:

Let's say a process installs a filter on itself like this:

struct landlock_attr_ruleset ruleset = { .handled_access_fs =
ACCESS_FS_ROUGHLY_WRITE};
int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
struct landlock_attr_path_beneath path_beneath = {
  .ruleset_fd = ruleset_fd,
  .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
  .parent_fd = open("/tmp/foobar", O_PATH),
};
landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
sizeof(path_beneath), &path_beneath);
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
sizeof(attr_enforce), &attr_enforce);

At this point, the process is not supposed to be able to write to
anything outside /tmp/foobar, right? But what happens if the process
does the following next?

struct landlock_attr_ruleset ruleset = { .handled_access_fs =
ACCESS_FS_ROUGHLY_WRITE};
int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
struct landlock_attr_path_beneath path_beneath = {
  .ruleset_fd = ruleset_fd,
  .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
  .parent_fd = open("/", O_PATH),
};
landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
sizeof(path_beneath), &path_beneath);
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
sizeof(attr_enforce), &attr_enforce);

As far as I can tell from looking at the source, after this, you will
have write access to the entire filesystem again. I think the idea is
that LANDLOCK_CMD_ENFORCE_RULESET should only let you drop privileges,
not increase them, right?

I think the easy way to fix this would be to add a bitmask to each
rule that says from which ruleset it originally comes, and then let
check_access_path() collect these bitmasks from each rule with OR, and
check at the end whether the resulting bitmask is full - if not, at
least one of the rulesets did not permit the access, and it should be
denied.

But maybe it would make more sense to change how the API works
instead, and get rid of the concept of "merging" two rulesets
together? Instead, we could make the API work like this:

 - LANDLOCK_CMD_CREATE_RULESET gives you a file descriptor whose
->private_data contains a pointer to the old ruleset of the process,
as well as a pointer to a new empty ruleset.
 - LANDLOCK_CMD_ADD_RULE fails if the specified rule would not be
permitted by the old ruleset, then adds the rule to the new ruleset
 - LANDLOCK_CMD_ENFORCE_RULESET fails if the old ruleset pointer in
->private_data doesn't match the current ruleset of the process, then
replaces the old ruleset with the new ruleset.

With this, the new ruleset is guaranteed to be a subset of the old
ruleset because each of the new ruleset's rules is permitted by the
old ruleset. (Unless the directory hierarchy rotates, but in that case
the inaccuracy isn't much worse than what would've been possible
through RCU path walk anyway AFAIK.)

What do you think?
Mickaël Salaün March 11, 2020, 11:38 p.m. UTC | #6
On 10/03/2020 00:44, Jann Horn wrote:
> On Mon, Feb 24, 2020 at 5:03 PM Mickaël Salaün <mic@digikod.net> wrote:
>> This new version of Landlock is a major revamp of the previous series
>> [1], hence the RFC tag.  The three main changes are the replacement of
>> eBPF with a dedicated safe management of access rules, the replacement
>> of the use of seccomp(2) with a dedicated syscall, and the management of
>> filesystem access-control (back from the v10).
>>
>> As discussed in [2], eBPF may be too powerful and dangerous to be put in
>> the hand of unprivileged and potentially malicious processes, especially
>> because of side-channel attacks against access-controls or other parts
>> of the kernel.
>>
>> Thanks to this new implementation (1540 SLOC), designed from the ground
>> to be used by unprivileged processes, this series enables a process to
>> sandbox itself without requiring CAP_SYS_ADMIN, but only the
>> no_new_privs constraint (like seccomp).  Not relying on eBPF also
>> enables to improve performances, especially for stacked security
>> policies thanks to mergeable rulesets.
>>
>> The compiled documentation is available here:
>> https://landlock.io/linux-doc/landlock-v14/security/landlock/index.html
>>
>> This series can be applied on top of v5.6-rc3.  This can be tested with
>> CONFIG_SECURITY_LANDLOCK and CONFIG_SAMPLE_LANDLOCK.  This patch series
>> can be found in a Git repository here:
>> https://github.com/landlock-lsm/linux/commits/landlock-v14
>> I would really appreciate constructive comments on the design and the code.
> 
> I've looked through the patchset, and I think that it would be
> possible to simplify it quite a bit. I have tried to do that (and
> compiled-tested it, but not actually tried running it); here's what I
> came up with:
> 
> https://github.com/thejh/linux/commits/landlock-mod
> 
> The three modified patches (patches 1, 2 and 5) are marked with
> "[MODIFIED]" in their title. Please take a look - what do you think?
> Feel free to integrate my changes into your patches if you think they
> make sense.

Regarding the landlock_release_inodes(), the final wait_var_event() is
indeed needed (as does fsnotify), but why do you use a READ_ONCE() for
landlock_initialized?

I was reluctant to use function pointers but landlock_object_operations
makes a cleaner and more generic interface to manage objects.

Your get_inode_object() is much simpler and easier to understand than
the get_object() and get_cleaner().
The other main change is about the object cross-reference: you entirely
removed it, which means that an object will only be free when there are
no rules using it. This does not free an object when its underlying
object is being terminated. We now only have to worry about the
termination of the parent of an underlying object (e.g. the super-block
of an inode).

However, I think you forgot to increment object->usage in
create_ruleset_elem(). There is also an unused checked_mask variable in
merge_ruleset().

All this removes optimizations that made the code more difficult to
understand. The performance difference is negligible, and I think that
the memory footprint is fine.
These optimizations (and others) could be discussed later. I'm
integrating most of your changes in the next patch series.

Thank you very much for this review and the code.

> 
> 
> Apart from simplifying the code, I also found the following issues,
> which I have fixed in the modified patches:
> 
> put_hierarchy() has to drop a reference on its parent. (However, this
> must not recurse, so we have to do it with a loop.)

Right, fixed.

> 
> put_ruleset() is not in an RCU read-side critical section, so as soon
> as it calls kfree_rcu(), "freeme" might disappear; but "orig" is in
> "freeme", so when the loop tries to find the next element with
> rb_next(orig), that can be a UAF.
> rbtree_postorder_for_each_entry_safe() exists for dealing with such
> issues.

Good catch.

> 
> AFAIK the calls to rb_erase() in clean_ruleset() is not safe if
> someone is concurrently accessing the rbtree as an RCU reader, because
> concurrent rotations can prevent a lookup from succeeding. The
> simplest fix is probably to just make any rbtree that has been
> installed on a process immutable, and give up on the cleaning -
> arguably the memory wastage that can cause is pretty limited.

Yes, let's go for immutable domains.

> (By the
> way, as a future optimization, we might want to turn the rbtree into a
> hashtable when installing it?)

Definitely. This was a previous (private) implementation I did for
domains, but to simplify the code I reused the same type as a ruleset. A
future evolution of Landlock could add back this optimization.

> 
> The iput() in landlock_release_inode() looks unsafe - you need to
> guarantee that even if the deletion of a ruleset races with
> generic_shutdown_super(), every iput() for that superblock finishes
> before landlock_release_inodes() returns, even if the iput() is
> happening in the context of ruleset deletion. This is why
> fsnotify_unmount_inodes() has that wait_var_event() at the end.

Right, much better with that.

> 
> 
> Aside from those things, there is also a major correctness issue where
> I'm not sure how to solve it properly:
> 
> Let's say a process installs a filter on itself like this:
> 
> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
> ACCESS_FS_ROUGHLY_WRITE};
> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
> struct landlock_attr_path_beneath path_beneath = {
>   .ruleset_fd = ruleset_fd,
>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
>   .parent_fd = open("/tmp/foobar", O_PATH),
> };
> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
> sizeof(path_beneath), &path_beneath);
> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
> sizeof(attr_enforce), &attr_enforce);
> 
> At this point, the process is not supposed to be able to write to
> anything outside /tmp/foobar, right? But what happens if the process
> does the following next?
> 
> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
> ACCESS_FS_ROUGHLY_WRITE};
> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
> struct landlock_attr_path_beneath path_beneath = {
>   .ruleset_fd = ruleset_fd,
>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
>   .parent_fd = open("/", O_PATH),
> };
> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
> sizeof(path_beneath), &path_beneath);
> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
> sizeof(attr_enforce), &attr_enforce);
> 
> As far as I can tell from looking at the source, after this, you will
> have write access to the entire filesystem again. I think the idea is
> that LANDLOCK_CMD_ENFORCE_RULESET should only let you drop privileges,
> not increase them, right?

There is an additionnal check in syscall.c:get_path_from_fd(): it is
forbidden to add a rule with a path which is not accessible (according
to LANDLOCK_ACCESS_FS_OPEN) thanks to a call to security_file_open(),
but this is definitely not perfect.

> 
> I think the easy way to fix this would be to add a bitmask to each
> rule that says from which ruleset it originally comes, and then let
> check_access_path() collect these bitmasks from each rule with OR, and
> check at the end whether the resulting bitmask is full - if not, at
> least one of the rulesets did not permit the access, and it should be
> denied.
> 
> But maybe it would make more sense to change how the API works
> instead, and get rid of the concept of "merging" two rulesets
> together? Instead, we could make the API work like this:
> 
>  - LANDLOCK_CMD_CREATE_RULESET gives you a file descriptor whose
> ->private_data contains a pointer to the old ruleset of the process,
> as well as a pointer to a new empty ruleset.
>  - LANDLOCK_CMD_ADD_RULE fails if the specified rule would not be
> permitted by the old ruleset, then adds the rule to the new ruleset
>  - LANDLOCK_CMD_ENFORCE_RULESET fails if the old ruleset pointer in
> ->private_data doesn't match the current ruleset of the process, then
> replaces the old ruleset with the new ruleset.
> 
> With this, the new ruleset is guaranteed to be a subset of the old
> ruleset because each of the new ruleset's rules is permitted by the
> old ruleset. (Unless the directory hierarchy rotates, but in that case
> the inaccuracy isn't much worse than what would've been possible
> through RCU path walk anyway AFAIK.)
> 
> What do you think?
> 

I would prefer to add the same checks you described at first (with
check_access_path), but only when creating a new ruleset with
merge_ruleset() (which should probably be renamed). This enables not to
rely on a parent ruleset/domain until the enforcement, which is the case
anyway.
Unfortunately this doesn't work for some cases with bind mounts. Because
check_access_path() goes through one path, another (bind mounted) path
could be illegitimately allowed.
That makes the problem a bit more complicated. A solution may be to keep
track of the hierarchy of each rule (e.g. with a layer/depth number),
and only allow an access request if at least a rule of each layer allow
this access. In this case we also need to correctly handle the case when
rules from different layers are tied to the same object.

I would like Landlock to have "pure" syscalls, in the sense that a
process A (e.g. a daemon) could prepare a ruleset and sends its FD to a
process B which would then be able to use it to sandbox itself. I think
it makes the reasoning clearer not to have a given ruleset (FD) tied to
a domain (i.e. parent ruleset) at first.
Landlock should (as much as possible) return an error if a syscall
argument is invalid, not according to the current access control (which
is not the case currently because of the security_file_open() check).
This means that these additional merge_ruleset() checks should only
affect the new domain/ruleset, but it should not be visible to userspace.

In a future evolution, it may be useful to add a lock/seal command to
deny any additional rule enforcement. However that may be
counter-productive because that enable application developers (e.g. for
a shell) to deny the use of Landlock features to its child processes.
But it would be possible anyway with seccomp-bpf…
Jann Horn March 17, 2020, 4:19 p.m. UTC | #7
On Thu, Mar 12, 2020 at 12:38 AM Mickaël Salaün <mic@digikod.net> wrote:
> On 10/03/2020 00:44, Jann Horn wrote:
> > On Mon, Feb 24, 2020 at 5:03 PM Mickaël Salaün <mic@digikod.net> wrote:
> >> This new version of Landlock is a major revamp of the previous series
> >> [1], hence the RFC tag.  The three main changes are the replacement of
> >> eBPF with a dedicated safe management of access rules, the replacement
> >> of the use of seccomp(2) with a dedicated syscall, and the management of
> >> filesystem access-control (back from the v10).
> >>
> >> As discussed in [2], eBPF may be too powerful and dangerous to be put in
> >> the hand of unprivileged and potentially malicious processes, especially
> >> because of side-channel attacks against access-controls or other parts
> >> of the kernel.
> >>
> >> Thanks to this new implementation (1540 SLOC), designed from the ground
> >> to be used by unprivileged processes, this series enables a process to
> >> sandbox itself without requiring CAP_SYS_ADMIN, but only the
> >> no_new_privs constraint (like seccomp).  Not relying on eBPF also
> >> enables to improve performances, especially for stacked security
> >> policies thanks to mergeable rulesets.
> >>
> >> The compiled documentation is available here:
> >> https://landlock.io/linux-doc/landlock-v14/security/landlock/index.html
> >>
> >> This series can be applied on top of v5.6-rc3.  This can be tested with
> >> CONFIG_SECURITY_LANDLOCK and CONFIG_SAMPLE_LANDLOCK.  This patch series
> >> can be found in a Git repository here:
> >> https://github.com/landlock-lsm/linux/commits/landlock-v14
> >> I would really appreciate constructive comments on the design and the code.
> >
> > I've looked through the patchset, and I think that it would be
> > possible to simplify it quite a bit. I have tried to do that (and
> > compiled-tested it, but not actually tried running it); here's what I
> > came up with:
> >
> > https://github.com/thejh/linux/commits/landlock-mod
> >
> > The three modified patches (patches 1, 2 and 5) are marked with
> > "[MODIFIED]" in their title. Please take a look - what do you think?
> > Feel free to integrate my changes into your patches if you think they
> > make sense.
>
> Regarding the landlock_release_inodes(), the final wait_var_event() is
> indeed needed (as does fsnotify), but why do you use a READ_ONCE() for
> landlock_initialized?

Ah, good point - that READ_ONCE() should be unnecessary.

> The other main change is about the object cross-reference: you entirely
> removed it, which means that an object will only be free when there are
> no rules using it. This does not free an object when its underlying
> object is being terminated. We now only have to worry about the
> termination of the parent of an underlying object (e.g. the super-block
> of an inode).
>
> However, I think you forgot to increment object->usage in
> create_ruleset_elem().

Whoops, you're right.

> There is also an unused checked_mask variable in merge_ruleset().

Oh, yeah, oops.

> All this removes optimizations that made the code more difficult to
> understand. The performance difference is negligible, and I think that
> the memory footprint is fine.
> These optimizations (and others) could be discussed later. I'm
> integrating most of your changes in the next patch series.

:)

> > Aside from those things, there is also a major correctness issue where
> > I'm not sure how to solve it properly:
> >
> > Let's say a process installs a filter on itself like this:
> >
> > struct landlock_attr_ruleset ruleset = { .handled_access_fs =
> > ACCESS_FS_ROUGHLY_WRITE};
> > int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
> > LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
> > struct landlock_attr_path_beneath path_beneath = {
> >   .ruleset_fd = ruleset_fd,
> >   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
> >   .parent_fd = open("/tmp/foobar", O_PATH),
> > };
> > landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
> > sizeof(path_beneath), &path_beneath);
> > prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> > struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
> > landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
> > sizeof(attr_enforce), &attr_enforce);
> >
> > At this point, the process is not supposed to be able to write to
> > anything outside /tmp/foobar, right? But what happens if the process
> > does the following next?
> >
> > struct landlock_attr_ruleset ruleset = { .handled_access_fs =
> > ACCESS_FS_ROUGHLY_WRITE};
> > int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
> > LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
> > struct landlock_attr_path_beneath path_beneath = {
> >   .ruleset_fd = ruleset_fd,
> >   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
> >   .parent_fd = open("/", O_PATH),
> > };
> > landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
> > sizeof(path_beneath), &path_beneath);
> > prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> > struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
> > landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
> > sizeof(attr_enforce), &attr_enforce);
> >
> > As far as I can tell from looking at the source, after this, you will
> > have write access to the entire filesystem again. I think the idea is
> > that LANDLOCK_CMD_ENFORCE_RULESET should only let you drop privileges,
> > not increase them, right?
>
> There is an additionnal check in syscall.c:get_path_from_fd(): it is
> forbidden to add a rule with a path which is not accessible (according
> to LANDLOCK_ACCESS_FS_OPEN) thanks to a call to security_file_open(),
> but this is definitely not perfect.

Ah, I missed that.

> > I think the easy way to fix this would be to add a bitmask to each
> > rule that says from which ruleset it originally comes, and then let
> > check_access_path() collect these bitmasks from each rule with OR, and
> > check at the end whether the resulting bitmask is full - if not, at
> > least one of the rulesets did not permit the access, and it should be
> > denied.
> >
> > But maybe it would make more sense to change how the API works
> > instead, and get rid of the concept of "merging" two rulesets
> > together? Instead, we could make the API work like this:
> >
> >  - LANDLOCK_CMD_CREATE_RULESET gives you a file descriptor whose
> > ->private_data contains a pointer to the old ruleset of the process,
> > as well as a pointer to a new empty ruleset.
> >  - LANDLOCK_CMD_ADD_RULE fails if the specified rule would not be
> > permitted by the old ruleset, then adds the rule to the new ruleset
> >  - LANDLOCK_CMD_ENFORCE_RULESET fails if the old ruleset pointer in
> > ->private_data doesn't match the current ruleset of the process, then
> > replaces the old ruleset with the new ruleset.
> >
> > With this, the new ruleset is guaranteed to be a subset of the old
> > ruleset because each of the new ruleset's rules is permitted by the
> > old ruleset. (Unless the directory hierarchy rotates, but in that case
> > the inaccuracy isn't much worse than what would've been possible
> > through RCU path walk anyway AFAIK.)
> >
> > What do you think?
> >
>
> I would prefer to add the same checks you described at first (with
> check_access_path), but only when creating a new ruleset with
> merge_ruleset() (which should probably be renamed). This enables not to
> rely on a parent ruleset/domain until the enforcement, which is the case
> anyway.
> Unfortunately this doesn't work for some cases with bind mounts. Because
> check_access_path() goes through one path, another (bind mounted) path
> could be illegitimately allowed.

Hmm... I'm not sure what you mean. At the moment, landlock doesn't
allow any sandboxed process to change the mount hierarchy, right? Can
you give an example where this would go wrong?

> That makes the problem a bit more complicated. A solution may be to keep
> track of the hierarchy of each rule (e.g. with a layer/depth number),
> and only allow an access request if at least a rule of each layer allow
> this access. In this case we also need to correctly handle the case when
> rules from different layers are tied to the same object.
Mickaël Salaün March 17, 2020, 5:50 p.m. UTC | #8
On 17/03/2020 17:19, Jann Horn wrote:
> On Thu, Mar 12, 2020 at 12:38 AM Mickaël Salaün <mic@digikod.net> wrote:
>> On 10/03/2020 00:44, Jann Horn wrote:
>>> On Mon, Feb 24, 2020 at 5:03 PM Mickaël Salaün <mic@digikod.net> wrote:

[...]

>>> Aside from those things, there is also a major correctness issue where
>>> I'm not sure how to solve it properly:
>>>
>>> Let's say a process installs a filter on itself like this:
>>>
>>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
>>> ACCESS_FS_ROUGHLY_WRITE};
>>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
>>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
>>> struct landlock_attr_path_beneath path_beneath = {
>>>   .ruleset_fd = ruleset_fd,
>>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
>>>   .parent_fd = open("/tmp/foobar", O_PATH),
>>> };
>>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
>>> sizeof(path_beneath), &path_beneath);
>>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
>>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
>>> sizeof(attr_enforce), &attr_enforce);
>>>
>>> At this point, the process is not supposed to be able to write to
>>> anything outside /tmp/foobar, right? But what happens if the process
>>> does the following next?
>>>
>>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
>>> ACCESS_FS_ROUGHLY_WRITE};
>>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
>>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
>>> struct landlock_attr_path_beneath path_beneath = {
>>>   .ruleset_fd = ruleset_fd,
>>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
>>>   .parent_fd = open("/", O_PATH),
>>> };
>>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
>>> sizeof(path_beneath), &path_beneath);
>>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
>>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
>>> sizeof(attr_enforce), &attr_enforce);
>>>
>>> As far as I can tell from looking at the source, after this, you will
>>> have write access to the entire filesystem again. I think the idea is
>>> that LANDLOCK_CMD_ENFORCE_RULESET should only let you drop privileges,
>>> not increase them, right?
>>
>> There is an additionnal check in syscall.c:get_path_from_fd(): it is
>> forbidden to add a rule with a path which is not accessible (according
>> to LANDLOCK_ACCESS_FS_OPEN) thanks to a call to security_file_open(),
>> but this is definitely not perfect.
> 
> Ah, I missed that.
> 
>>> I think the easy way to fix this would be to add a bitmask to each
>>> rule that says from which ruleset it originally comes, and then let
>>> check_access_path() collect these bitmasks from each rule with OR, and
>>> check at the end whether the resulting bitmask is full - if not, at
>>> least one of the rulesets did not permit the access, and it should be
>>> denied.
>>>
>>> But maybe it would make more sense to change how the API works
>>> instead, and get rid of the concept of "merging" two rulesets
>>> together? Instead, we could make the API work like this:
>>>
>>>  - LANDLOCK_CMD_CREATE_RULESET gives you a file descriptor whose
>>> ->private_data contains a pointer to the old ruleset of the process,
>>> as well as a pointer to a new empty ruleset.
>>>  - LANDLOCK_CMD_ADD_RULE fails if the specified rule would not be
>>> permitted by the old ruleset, then adds the rule to the new ruleset
>>>  - LANDLOCK_CMD_ENFORCE_RULESET fails if the old ruleset pointer in
>>> ->private_data doesn't match the current ruleset of the process, then
>>> replaces the old ruleset with the new ruleset.
>>>
>>> With this, the new ruleset is guaranteed to be a subset of the old
>>> ruleset because each of the new ruleset's rules is permitted by the
>>> old ruleset. (Unless the directory hierarchy rotates, but in that case
>>> the inaccuracy isn't much worse than what would've been possible
>>> through RCU path walk anyway AFAIK.)
>>>
>>> What do you think?
>>>
>>
>> I would prefer to add the same checks you described at first (with
>> check_access_path), but only when creating a new ruleset with
>> merge_ruleset() (which should probably be renamed). This enables not to
>> rely on a parent ruleset/domain until the enforcement, which is the case
>> anyway.
>> Unfortunately this doesn't work for some cases with bind mounts. Because
>> check_access_path() goes through one path, another (bind mounted) path
>> could be illegitimately allowed.
> 
> Hmm... I'm not sure what you mean. At the moment, landlock doesn't
> allow any sandboxed process to change the mount hierarchy, right? Can
> you give an example where this would go wrong?

Indeed, a Landlocked process must no be able to change its mount
namespace layout. However, bind mounts may already exist.
Let's say a process sandbox itself to only access /a in a read-write
way. Then, this process (or one of its children) add a new restriction
on /a/b to only be able to read this hierarchy. The check at insertion
time would allow this because this access right is a subset of the
access right allowed with the parent directory. However, If /a/b is bind
mounted somewhere else, let's say in /private/b, then the second
enforcement just gave new access rights to this hierarchy too. This is
why it seems risky to rely on a check about the legitimacy of a new
access right when adding it to a ruleset or when enforcing it.


> 
>> That makes the problem a bit more complicated. A solution may be to keep
>> track of the hierarchy of each rule (e.g. with a layer/depth number),
>> and only allow an access request if at least a rule of each layer allow
>> this access. In this case we also need to correctly handle the case when
>> rules from different layers are tied to the same object.
>
Jann Horn March 17, 2020, 7:45 p.m. UTC | #9
On Tue, Mar 17, 2020 at 6:50 PM Mickaël Salaün <mic@digikod.net> wrote:
> On 17/03/2020 17:19, Jann Horn wrote:
> > On Thu, Mar 12, 2020 at 12:38 AM Mickaël Salaün <mic@digikod.net> wrote:
> >> On 10/03/2020 00:44, Jann Horn wrote:
> >>> On Mon, Feb 24, 2020 at 5:03 PM Mickaël Salaün <mic@digikod.net> wrote:
>
> [...]
>
> >>> Aside from those things, there is also a major correctness issue where
> >>> I'm not sure how to solve it properly:
> >>>
> >>> Let's say a process installs a filter on itself like this:
> >>>
> >>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
> >>> ACCESS_FS_ROUGHLY_WRITE};
> >>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
> >>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
> >>> struct landlock_attr_path_beneath path_beneath = {
> >>>   .ruleset_fd = ruleset_fd,
> >>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
> >>>   .parent_fd = open("/tmp/foobar", O_PATH),
> >>> };
> >>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
> >>> sizeof(path_beneath), &path_beneath);
> >>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> >>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
> >>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
> >>> sizeof(attr_enforce), &attr_enforce);
> >>>
> >>> At this point, the process is not supposed to be able to write to
> >>> anything outside /tmp/foobar, right? But what happens if the process
> >>> does the following next?
> >>>
> >>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
> >>> ACCESS_FS_ROUGHLY_WRITE};
> >>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
> >>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
> >>> struct landlock_attr_path_beneath path_beneath = {
> >>>   .ruleset_fd = ruleset_fd,
> >>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
> >>>   .parent_fd = open("/", O_PATH),
> >>> };
> >>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
> >>> sizeof(path_beneath), &path_beneath);
> >>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> >>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
> >>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
> >>> sizeof(attr_enforce), &attr_enforce);
> >>>
> >>> As far as I can tell from looking at the source, after this, you will
> >>> have write access to the entire filesystem again. I think the idea is
> >>> that LANDLOCK_CMD_ENFORCE_RULESET should only let you drop privileges,
> >>> not increase them, right?
> >>
> >> There is an additionnal check in syscall.c:get_path_from_fd(): it is
> >> forbidden to add a rule with a path which is not accessible (according
> >> to LANDLOCK_ACCESS_FS_OPEN) thanks to a call to security_file_open(),
> >> but this is definitely not perfect.
> >
> > Ah, I missed that.
> >
> >>> I think the easy way to fix this would be to add a bitmask to each
> >>> rule that says from which ruleset it originally comes, and then let
> >>> check_access_path() collect these bitmasks from each rule with OR, and
> >>> check at the end whether the resulting bitmask is full - if not, at
> >>> least one of the rulesets did not permit the access, and it should be
> >>> denied.
> >>>
> >>> But maybe it would make more sense to change how the API works
> >>> instead, and get rid of the concept of "merging" two rulesets
> >>> together? Instead, we could make the API work like this:
> >>>
> >>>  - LANDLOCK_CMD_CREATE_RULESET gives you a file descriptor whose
> >>> ->private_data contains a pointer to the old ruleset of the process,
> >>> as well as a pointer to a new empty ruleset.
> >>>  - LANDLOCK_CMD_ADD_RULE fails if the specified rule would not be
> >>> permitted by the old ruleset, then adds the rule to the new ruleset
> >>>  - LANDLOCK_CMD_ENFORCE_RULESET fails if the old ruleset pointer in
> >>> ->private_data doesn't match the current ruleset of the process, then
> >>> replaces the old ruleset with the new ruleset.
> >>>
> >>> With this, the new ruleset is guaranteed to be a subset of the old
> >>> ruleset because each of the new ruleset's rules is permitted by the
> >>> old ruleset. (Unless the directory hierarchy rotates, but in that case
> >>> the inaccuracy isn't much worse than what would've been possible
> >>> through RCU path walk anyway AFAIK.)
> >>>
> >>> What do you think?
> >>>
> >>
> >> I would prefer to add the same checks you described at first (with
> >> check_access_path), but only when creating a new ruleset with
> >> merge_ruleset() (which should probably be renamed). This enables not to
> >> rely on a parent ruleset/domain until the enforcement, which is the case
> >> anyway.
> >> Unfortunately this doesn't work for some cases with bind mounts. Because
> >> check_access_path() goes through one path, another (bind mounted) path
> >> could be illegitimately allowed.
> >
> > Hmm... I'm not sure what you mean. At the moment, landlock doesn't
> > allow any sandboxed process to change the mount hierarchy, right? Can
> > you give an example where this would go wrong?
>
> Indeed, a Landlocked process must no be able to change its mount
> namespace layout. However, bind mounts may already exist.
> Let's say a process sandbox itself to only access /a in a read-write
> way.

So, first policy:

/a RW

> Then, this process (or one of its children) add a new restriction
> on /a/b to only be able to read this hierarchy.

You mean with the second policy looking like this?

/a RW
/a/b R

Then the resulting policy would be:

/a RW policy_bitmask=0x00000003 (bits 0 and 1 set)
/a/b R policy_bitmask=0x00000002 (bit 1 set)
required_bits=0x00000003 (bits 0 and 1 set)

> The check at insertion
> time would allow this because this access right is a subset of the
> access right allowed with the parent directory. However, If /a/b is bind
> mounted somewhere else, let's say in /private/b, then the second
> enforcement just gave new access rights to this hierarchy too.

But with the solution I proposed, landlock's path walk would see
something like this when accessing a file at /private/b/foo:
/private/b/foo <no rules>
  policies seen until now: 0x00000000
/private/b <access: R, policy_bitmask=0x00000002>
  policies seen until now: 0x00000002
/private <no rules>
  policies seen until now: 0x00000002
/ <no rules>
  policies seen until now: 0x00000002

It wouldn't encounter any rule from the first policy, so the OR of the
seen policy bitmasks would be 0x00000002, which is not the required
value 0x00000003, and so the access would be denied.
Mickaël Salaün March 18, 2020, 12:06 p.m. UTC | #10
On 17/03/2020 20:45, Jann Horn wrote:
> On Tue, Mar 17, 2020 at 6:50 PM Mickaël Salaün <mic@digikod.net> wrote:
>> On 17/03/2020 17:19, Jann Horn wrote:
>>> On Thu, Mar 12, 2020 at 12:38 AM Mickaël Salaün <mic@digikod.net> wrote:
>>>> On 10/03/2020 00:44, Jann Horn wrote:
>>>>> On Mon, Feb 24, 2020 at 5:03 PM Mickaël Salaün <mic@digikod.net> wrote:
>>
>> [...]
>>
>>>>> Aside from those things, there is also a major correctness issue where
>>>>> I'm not sure how to solve it properly:
>>>>>
>>>>> Let's say a process installs a filter on itself like this:
>>>>>
>>>>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
>>>>> ACCESS_FS_ROUGHLY_WRITE};
>>>>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
>>>>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
>>>>> struct landlock_attr_path_beneath path_beneath = {
>>>>>   .ruleset_fd = ruleset_fd,
>>>>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
>>>>>   .parent_fd = open("/tmp/foobar", O_PATH),
>>>>> };
>>>>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
>>>>> sizeof(path_beneath), &path_beneath);
>>>>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>>>>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
>>>>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
>>>>> sizeof(attr_enforce), &attr_enforce);
>>>>>
>>>>> At this point, the process is not supposed to be able to write to
>>>>> anything outside /tmp/foobar, right? But what happens if the process
>>>>> does the following next?
>>>>>
>>>>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
>>>>> ACCESS_FS_ROUGHLY_WRITE};
>>>>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
>>>>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
>>>>> struct landlock_attr_path_beneath path_beneath = {
>>>>>   .ruleset_fd = ruleset_fd,
>>>>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
>>>>>   .parent_fd = open("/", O_PATH),
>>>>> };
>>>>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
>>>>> sizeof(path_beneath), &path_beneath);
>>>>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>>>>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
>>>>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
>>>>> sizeof(attr_enforce), &attr_enforce);
>>>>>
>>>>> As far as I can tell from looking at the source, after this, you will
>>>>> have write access to the entire filesystem again. I think the idea is
>>>>> that LANDLOCK_CMD_ENFORCE_RULESET should only let you drop privileges,
>>>>> not increase them, right?
>>>>
>>>> There is an additionnal check in syscall.c:get_path_from_fd(): it is
>>>> forbidden to add a rule with a path which is not accessible (according
>>>> to LANDLOCK_ACCESS_FS_OPEN) thanks to a call to security_file_open(),
>>>> but this is definitely not perfect.
>>>
>>> Ah, I missed that.
>>>
>>>>> I think the easy way to fix this would be to add a bitmask to each
>>>>> rule that says from which ruleset it originally comes, and then let
>>>>> check_access_path() collect these bitmasks from each rule with OR, and
>>>>> check at the end whether the resulting bitmask is full - if not, at
>>>>> least one of the rulesets did not permit the access, and it should be
>>>>> denied.
>>>>>
>>>>> But maybe it would make more sense to change how the API works
>>>>> instead, and get rid of the concept of "merging" two rulesets
>>>>> together? Instead, we could make the API work like this:
>>>>>
>>>>>  - LANDLOCK_CMD_CREATE_RULESET gives you a file descriptor whose
>>>>> ->private_data contains a pointer to the old ruleset of the process,
>>>>> as well as a pointer to a new empty ruleset.
>>>>>  - LANDLOCK_CMD_ADD_RULE fails if the specified rule would not be
>>>>> permitted by the old ruleset, then adds the rule to the new ruleset
>>>>>  - LANDLOCK_CMD_ENFORCE_RULESET fails if the old ruleset pointer in
>>>>> ->private_data doesn't match the current ruleset of the process, then
>>>>> replaces the old ruleset with the new ruleset.
>>>>>
>>>>> With this, the new ruleset is guaranteed to be a subset of the old
>>>>> ruleset because each of the new ruleset's rules is permitted by the
>>>>> old ruleset. (Unless the directory hierarchy rotates, but in that case
>>>>> the inaccuracy isn't much worse than what would've been possible
>>>>> through RCU path walk anyway AFAIK.)
>>>>>
>>>>> What do you think?
>>>>>
>>>>
>>>> I would prefer to add the same checks you described at first (with
>>>> check_access_path), but only when creating a new ruleset with
>>>> merge_ruleset() (which should probably be renamed). This enables not to
>>>> rely on a parent ruleset/domain until the enforcement, which is the case
>>>> anyway.
>>>> Unfortunately this doesn't work for some cases with bind mounts. Because
>>>> check_access_path() goes through one path, another (bind mounted) path
>>>> could be illegitimately allowed.
>>>
>>> Hmm... I'm not sure what you mean. At the moment, landlock doesn't
>>> allow any sandboxed process to change the mount hierarchy, right? Can
>>> you give an example where this would go wrong?
>>
>> Indeed, a Landlocked process must no be able to change its mount
>> namespace layout. However, bind mounts may already exist.
>> Let's say a process sandbox itself to only access /a in a read-write
>> way.
> 
> So, first policy:
> 
> /a RW
> 
>> Then, this process (or one of its children) add a new restriction
>> on /a/b to only be able to read this hierarchy.
> 
> You mean with the second policy looking like this?

Right.

> 
> /a RW
> /a/b R
> 
> Then the resulting policy would be:
> 
> /a RW policy_bitmask=0x00000003 (bits 0 and 1 set)
> /a/b R policy_bitmask=0x00000002 (bit 1 set)
> required_bits=0x00000003 (bits 0 and 1 set)
> 
>> The check at insertion
>> time would allow this because this access right is a subset of the
>> access right allowed with the parent directory. However, If /a/b is bind
>> mounted somewhere else, let's say in /private/b, then the second
>> enforcement just gave new access rights to this hierarchy too.
> 
> But with the solution I proposed, landlock's path walk would see
> something like this when accessing a file at /private/b/foo:
> /private/b/foo <no rules>
>   policies seen until now: 0x00000000
> /private/b <access: R, policy_bitmask=0x00000002>
>   policies seen until now: 0x00000002
> /private <no rules>
>   policies seen until now: 0x00000002
> / <no rules>
>   policies seen until now: 0x00000002
> 
> It wouldn't encounter any rule from the first policy, so the OR of the
> seen policy bitmasks would be 0x00000002, which is not the required
> value 0x00000003, and so the access would be denied.
As I understand your proposition, we need to build the required_bits
when adding a rule or enforcing/merging a ruleset with a domain. The
issue is that a rule only refers to a struct inode, not a struct path.
For your proposition to work, we would need to walk through the file
path when adding a rule to a ruleset, which means that we need to depend
of the current view of the process (i.e. its mount namespace), and its
Landlock domain. If the required_bits field is set when the ruleset is
merged with the domain, it is not possible anymore to walk through the
corresponding initial file path, which makes the enforcement step too
late to check for such consistency. The important point is that a
ruleset/domain doesn't have a notion of file hierarchy, a ruleset is
only a set of tagged inodes.

I'm not sure I got your proposition right, though. When and how would
you generate the required_bits?

Here is my updated proposition: add a layer level and a depth to each
rule (once enforced/merged with a domain), and a top layer level for a
domain. When enforcing a ruleset (i.e. merging a ruleset into the
current domain), the layer level of a new rule would be the incremented
top layer level. If there is no rule (from this domain) tied to the same
inode, then the depth of the new rule is 1. However, if there is already
a rule tied to the same inode and if this rule's layer level is the
previous top layer level, then the depth and the layer level are both
incremented and the rule is updated with the new access rights (boolean
AND).

The policy looks like this:
domain top_layer=2
/a RW policy_bitmask=0x00000003 layer=1 depth=1
/a/b R policy_bitmask=0x00000002 layer=2 depth=1

The path walk access check walks through all inodes and start with a
layer counter equal to the top layer of the current domain. For each
encountered inode tied to a rule, the access rights are checked and a
new check ensures that the layer of the matching rule is the same as the
counter (this may be a merged ruleset containing rules pertaining to the
same hierarchy, which is fine) or equal to the decremented counter (i.e.
the path walk just reached the underlying layer). If the path walk
encounter a rule with a layer strictly less than the counter minus one,
there is a whole in the layers which means that the ruleset
hierarchy/subset does not match, and the access must be denied.

When accessing a file at /private/b/foo for a read access:
/private/b/foo <no rules>
  allowed_access=unknown layer_counter=2
/private/b <access: R, policy_bitmask=0x00000002, layer=2, depth=1>
  allowed_access=allowed layer_counter=2
/private <no rules>
  allowed_access=allowed layer_counter=2
/ <no rules>
  allowed_access=allowed layer_counter=2

Because the layer_counter didn't reach 1, the access request is then denied.

This proposition enables not to rely on a parent ruleset at first, only
when enforcing/merging a ruleset with a domain. This also solves the
issue with multiple inherited/nested rules on the same inode (in which
case the depth just grows). Moreover, this enables to safely stop the
path walk as soon as we reach the layer 1.

Here is a more complex example. A process sandbox itself with a first rule:
domain top_layer=1
/a RW policy_bitmask=0x00000003 layer=1 depth=1

Then the sandbox process enforces to itself this second (useless) ruleset:
/a/b RW policy_bitmask=0x00000003

The resulting domain is then:
domain top_layer=2
/a RW policy_bitmask=0x00000003 layer=1 depth=1
/a/b RW policy_bitmask=0x00000003 layer=2 depth=1

Then the sandbox process enforces to itself this third ruleset (which
effectively reduces its access):
/a/b R policy_bitmask=0x00000002

The resulting domain is then:
domain top_layer=3
/a RW policy_bitmask=0x00000003 layer=1 depth=1
/a/b R policy_bitmask=0x00000002 layer=3 depth=2

At this time, only /a/b is accessible in a read way. The access rights
on /a are ignored (but still inherited).

Then the sandbox process enforces to itself this fourth ruleset:
/c R policy_bitmask=0x00000002

The resulting domain is then:
domain top_layer=4
/a RW policy_bitmask=0x00000003 layer=1 depth=1
/a/b R policy_bitmask=0x00000002 layer=3 depth=2
/c R policy_bitmask=0x00000002 layer=4 depth=1

Now, every read or write access requests will be denied.

Then the sandbox process enforces to itself this fifth ruleset:
/a R policy_bitmask=0x00000002

Because /a is not in a contiguous underneath layer, the resulting domain
is unchanged (except the top_layer which may be incremented anyway).
Of course, we must check that the top_layer is not overflowing, in which
case an error must be returned to inform userspace that the ruleset
can't be enforced.
Jann Horn March 18, 2020, 11:33 p.m. UTC | #11
On Wed, Mar 18, 2020 at 1:06 PM Mickaël Salaün <mic@digikod.net> wrote:
> On 17/03/2020 20:45, Jann Horn wrote:
> > On Tue, Mar 17, 2020 at 6:50 PM Mickaël Salaün <mic@digikod.net> wrote:
> >> On 17/03/2020 17:19, Jann Horn wrote:
> >>> On Thu, Mar 12, 2020 at 12:38 AM Mickaël Salaün <mic@digikod.net> wrote:
> >>>> On 10/03/2020 00:44, Jann Horn wrote:
> >>>>> On Mon, Feb 24, 2020 at 5:03 PM Mickaël Salaün <mic@digikod.net> wrote:
> >>
> >> [...]
> >>
> >>>>> Aside from those things, there is also a major correctness issue where
> >>>>> I'm not sure how to solve it properly:
> >>>>>
> >>>>> Let's say a process installs a filter on itself like this:
> >>>>>
> >>>>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
> >>>>> ACCESS_FS_ROUGHLY_WRITE};
> >>>>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
> >>>>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
> >>>>> struct landlock_attr_path_beneath path_beneath = {
> >>>>>   .ruleset_fd = ruleset_fd,
> >>>>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
> >>>>>   .parent_fd = open("/tmp/foobar", O_PATH),
> >>>>> };
> >>>>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
> >>>>> sizeof(path_beneath), &path_beneath);
> >>>>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> >>>>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
> >>>>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
> >>>>> sizeof(attr_enforce), &attr_enforce);
> >>>>>
> >>>>> At this point, the process is not supposed to be able to write to
> >>>>> anything outside /tmp/foobar, right? But what happens if the process
> >>>>> does the following next?
> >>>>>
> >>>>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
> >>>>> ACCESS_FS_ROUGHLY_WRITE};
> >>>>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
> >>>>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
> >>>>> struct landlock_attr_path_beneath path_beneath = {
> >>>>>   .ruleset_fd = ruleset_fd,
> >>>>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
> >>>>>   .parent_fd = open("/", O_PATH),
> >>>>> };
> >>>>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
> >>>>> sizeof(path_beneath), &path_beneath);
> >>>>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
> >>>>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
> >>>>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
> >>>>> sizeof(attr_enforce), &attr_enforce);
> >>>>>
> >>>>> As far as I can tell from looking at the source, after this, you will
> >>>>> have write access to the entire filesystem again. I think the idea is
> >>>>> that LANDLOCK_CMD_ENFORCE_RULESET should only let you drop privileges,
> >>>>> not increase them, right?
> >>>>
> >>>> There is an additionnal check in syscall.c:get_path_from_fd(): it is
> >>>> forbidden to add a rule with a path which is not accessible (according
> >>>> to LANDLOCK_ACCESS_FS_OPEN) thanks to a call to security_file_open(),
> >>>> but this is definitely not perfect.
> >>>
> >>> Ah, I missed that.
> >>>
> >>>>> I think the easy way to fix this would be to add a bitmask to each
> >>>>> rule that says from which ruleset it originally comes, and then let
> >>>>> check_access_path() collect these bitmasks from each rule with OR, and
> >>>>> check at the end whether the resulting bitmask is full - if not, at
> >>>>> least one of the rulesets did not permit the access, and it should be
> >>>>> denied.
> >>>>>
> >>>>> But maybe it would make more sense to change how the API works
> >>>>> instead, and get rid of the concept of "merging" two rulesets
> >>>>> together? Instead, we could make the API work like this:
> >>>>>
> >>>>>  - LANDLOCK_CMD_CREATE_RULESET gives you a file descriptor whose
> >>>>> ->private_data contains a pointer to the old ruleset of the process,
> >>>>> as well as a pointer to a new empty ruleset.
> >>>>>  - LANDLOCK_CMD_ADD_RULE fails if the specified rule would not be
> >>>>> permitted by the old ruleset, then adds the rule to the new ruleset
> >>>>>  - LANDLOCK_CMD_ENFORCE_RULESET fails if the old ruleset pointer in
> >>>>> ->private_data doesn't match the current ruleset of the process, then
> >>>>> replaces the old ruleset with the new ruleset.
> >>>>>
> >>>>> With this, the new ruleset is guaranteed to be a subset of the old
> >>>>> ruleset because each of the new ruleset's rules is permitted by the
> >>>>> old ruleset. (Unless the directory hierarchy rotates, but in that case
> >>>>> the inaccuracy isn't much worse than what would've been possible
> >>>>> through RCU path walk anyway AFAIK.)
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>
> >>>> I would prefer to add the same checks you described at first (with
> >>>> check_access_path), but only when creating a new ruleset with
> >>>> merge_ruleset() (which should probably be renamed). This enables not to
> >>>> rely on a parent ruleset/domain until the enforcement, which is the case
> >>>> anyway.
> >>>> Unfortunately this doesn't work for some cases with bind mounts. Because
> >>>> check_access_path() goes through one path, another (bind mounted) path
> >>>> could be illegitimately allowed.
> >>>
> >>> Hmm... I'm not sure what you mean. At the moment, landlock doesn't
> >>> allow any sandboxed process to change the mount hierarchy, right? Can
> >>> you give an example where this would go wrong?
> >>
> >> Indeed, a Landlocked process must no be able to change its mount
> >> namespace layout. However, bind mounts may already exist.
> >> Let's say a process sandbox itself to only access /a in a read-write
> >> way.
> >
> > So, first policy:
> >
> > /a RW
> >
> >> Then, this process (or one of its children) add a new restriction
> >> on /a/b to only be able to read this hierarchy.
> >
> > You mean with the second policy looking like this?
>
> Right.
>
> >
> > /a RW
> > /a/b R
> >
> > Then the resulting policy would be:
> >
> > /a RW policy_bitmask=0x00000003 (bits 0 and 1 set)
> > /a/b R policy_bitmask=0x00000002 (bit 1 set)
> > required_bits=0x00000003 (bits 0 and 1 set)
> >
> >> The check at insertion
> >> time would allow this because this access right is a subset of the
> >> access right allowed with the parent directory. However, If /a/b is bind
> >> mounted somewhere else, let's say in /private/b, then the second
> >> enforcement just gave new access rights to this hierarchy too.
> >
> > But with the solution I proposed, landlock's path walk would see
> > something like this when accessing a file at /private/b/foo:
> > /private/b/foo <no rules>
> >   policies seen until now: 0x00000000
> > /private/b <access: R, policy_bitmask=0x00000002>
> >   policies seen until now: 0x00000002
> > /private <no rules>
> >   policies seen until now: 0x00000002
> > / <no rules>
> >   policies seen until now: 0x00000002
> >
> > It wouldn't encounter any rule from the first policy, so the OR of the
> > seen policy bitmasks would be 0x00000002, which is not the required
> > value 0x00000003, and so the access would be denied.
> As I understand your proposition, we need to build the required_bits
> when adding a rule or enforcing/merging a ruleset with a domain. The
> issue is that a rule only refers to a struct inode, not a struct path.
> For your proposition to work, we would need to walk through the file
> path when adding a rule to a ruleset, which means that we need to depend
> of the current view of the process (i.e. its mount namespace), and its
> Landlock domain.

I don't see why that is necessary. Why would we have to walk the file
path when adding a rule?

> If the required_bits field is set when the ruleset is
> merged with the domain, it is not possible anymore to walk through the
> corresponding initial file path, which makes the enforcement step too
> late to check for such consistency. The important point is that a
> ruleset/domain doesn't have a notion of file hierarchy, a ruleset is
> only a set of tagged inodes.
>
> I'm not sure I got your proposition right, though. When and how would
> you generate the required_bits?

Using your terminology:
A domain is a collection of N layers, which are assigned indices 0..N-1.
For each possible access type, a domain has a bitmask containing N
bits that stores which layers control that access type. (Basically a
per-layer version of fs_access_mask.)
To validate an access, you start by ORing together the bitmasks for
the requested access types; that gives you the required_bits mask,
which lists all layers that want to control the access.
Then you set seen_policy_bits=0, then do the
check_access_path_continue() loop while keeping track of which layers
you've seen with "seen_policy_bits |= access->contributing_policies",
or something like that.
And in the end, you check that seen_policy_bits is a superset of
required_bits - something like `(~seen_policy_bits) & required_bits ==
0`.

AFAICS to create a new domain from a bunch of layers, you wouldn't
have to do any path walking.

> Here is my updated proposition: add a layer level and a depth to each
> rule (once enforced/merged with a domain), and a top layer level for a
> domain. When enforcing a ruleset (i.e. merging a ruleset into the
> current domain), the layer level of a new rule would be the incremented
> top layer level.
> If there is no rule (from this domain) tied to the same
> inode, then the depth of the new rule is 1. However, if there is already
> a rule tied to the same inode and if this rule's layer level is the
> previous top layer level, then the depth and the layer level are both
> incremented and the rule is updated with the new access rights (boolean
> AND).
>
> The policy looks like this:
> domain top_layer=2
> /a RW policy_bitmask=0x00000003 layer=1 depth=1
> /a/b R policy_bitmask=0x00000002 layer=2 depth=1
>
> The path walk access check walks through all inodes and start with a
> layer counter equal to the top layer of the current domain. For each
> encountered inode tied to a rule, the access rights are checked and a
> new check ensures that the layer of the matching rule is the same as the
> counter (this may be a merged ruleset containing rules pertaining to the
> same hierarchy, which is fine) or equal to the decremented counter (i.e.
> the path walk just reached the underlying layer). If the path walk
> encounter a rule with a layer strictly less than the counter minus one,
> there is a whole in the layers which means that the ruleset
> hierarchy/subset does not match, and the access must be denied.
>
> When accessing a file at /private/b/foo for a read access:
> /private/b/foo <no rules>
>   allowed_access=unknown layer_counter=2
> /private/b <access: R, policy_bitmask=0x00000002, layer=2, depth=1>
>   allowed_access=allowed layer_counter=2
> /private <no rules>
>   allowed_access=allowed layer_counter=2
> / <no rules>
>   allowed_access=allowed layer_counter=2
>
> Because the layer_counter didn't reach 1, the access request is then denied.
>
> This proposition enables not to rely on a parent ruleset at first, only
> when enforcing/merging a ruleset with a domain. This also solves the
> issue with multiple inherited/nested rules on the same inode (in which
> case the depth just grows). Moreover, this enables to safely stop the
> path walk as soon as we reach the layer 1.

(FWIW, you could do the same optimization with the seen_policy_bits approach.)

I guess the difference between your proposal and mine is that in my
proposal, the following would work, in effect permitting W access to
/foo/bar/baz (and nothing else)?

first ruleset:
  /foo W
second ruleset:
  /foo/bar/baz W
third ruleset:
  /foo/bar W

whereas in your proposal, IIUC it wouldn't be valid for a new ruleset
to whitelist a superset of what was whitelisted in a previous ruleset?
Mickaël Salaün March 19, 2020, 4:58 p.m. UTC | #12
On 19/03/2020 00:33, Jann Horn wrote:
> On Wed, Mar 18, 2020 at 1:06 PM Mickaël Salaün <mic@digikod.net> wrote:
>> On 17/03/2020 20:45, Jann Horn wrote:
>>> On Tue, Mar 17, 2020 at 6:50 PM Mickaël Salaün <mic@digikod.net> wrote:
>>>> On 17/03/2020 17:19, Jann Horn wrote:
>>>>> On Thu, Mar 12, 2020 at 12:38 AM Mickaël Salaün <mic@digikod.net> wrote:
>>>>>> On 10/03/2020 00:44, Jann Horn wrote:
>>>>>>> On Mon, Feb 24, 2020 at 5:03 PM Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>> [...]
>>>>
>>>>>>> Aside from those things, there is also a major correctness issue where
>>>>>>> I'm not sure how to solve it properly:
>>>>>>>
>>>>>>> Let's say a process installs a filter on itself like this:
>>>>>>>
>>>>>>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
>>>>>>> ACCESS_FS_ROUGHLY_WRITE};
>>>>>>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
>>>>>>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
>>>>>>> struct landlock_attr_path_beneath path_beneath = {
>>>>>>>   .ruleset_fd = ruleset_fd,
>>>>>>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
>>>>>>>   .parent_fd = open("/tmp/foobar", O_PATH),
>>>>>>> };
>>>>>>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
>>>>>>> sizeof(path_beneath), &path_beneath);
>>>>>>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>>>>>>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
>>>>>>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
>>>>>>> sizeof(attr_enforce), &attr_enforce);
>>>>>>>
>>>>>>> At this point, the process is not supposed to be able to write to
>>>>>>> anything outside /tmp/foobar, right? But what happens if the process
>>>>>>> does the following next?
>>>>>>>
>>>>>>> struct landlock_attr_ruleset ruleset = { .handled_access_fs =
>>>>>>> ACCESS_FS_ROUGHLY_WRITE};
>>>>>>> int ruleset_fd = landlock(LANDLOCK_CMD_CREATE_RULESET,
>>>>>>> LANDLOCK_OPT_CREATE_RULESET, sizeof(ruleset), &ruleset);
>>>>>>> struct landlock_attr_path_beneath path_beneath = {
>>>>>>>   .ruleset_fd = ruleset_fd,
>>>>>>>   .allowed_access = ACCESS_FS_ROUGHLY_WRITE,
>>>>>>>   .parent_fd = open("/", O_PATH),
>>>>>>> };
>>>>>>> landlock(LANDLOCK_CMD_ADD_RULE, LANDLOCK_OPT_ADD_RULE_PATH_BENEATH,
>>>>>>> sizeof(path_beneath), &path_beneath);
>>>>>>> prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>>>>>>> struct landlock_attr_enforce attr_enforce = { .ruleset_fd = ruleset_fd };
>>>>>>> landlock(LANDLOCK_CMD_ENFORCE_RULESET, LANDLOCK_OPT_ENFORCE_RULESET,
>>>>>>> sizeof(attr_enforce), &attr_enforce);
>>>>>>>
>>>>>>> As far as I can tell from looking at the source, after this, you will
>>>>>>> have write access to the entire filesystem again. I think the idea is
>>>>>>> that LANDLOCK_CMD_ENFORCE_RULESET should only let you drop privileges,
>>>>>>> not increase them, right?
>>>>>>
>>>>>> There is an additionnal check in syscall.c:get_path_from_fd(): it is
>>>>>> forbidden to add a rule with a path which is not accessible (according
>>>>>> to LANDLOCK_ACCESS_FS_OPEN) thanks to a call to security_file_open(),
>>>>>> but this is definitely not perfect.
>>>>>
>>>>> Ah, I missed that.
>>>>>
>>>>>>> I think the easy way to fix this would be to add a bitmask to each
>>>>>>> rule that says from which ruleset it originally comes, and then let
>>>>>>> check_access_path() collect these bitmasks from each rule with OR, and
>>>>>>> check at the end whether the resulting bitmask is full - if not, at
>>>>>>> least one of the rulesets did not permit the access, and it should be
>>>>>>> denied.
>>>>>>>
>>>>>>> But maybe it would make more sense to change how the API works
>>>>>>> instead, and get rid of the concept of "merging" two rulesets
>>>>>>> together? Instead, we could make the API work like this:
>>>>>>>
>>>>>>>  - LANDLOCK_CMD_CREATE_RULESET gives you a file descriptor whose
>>>>>>> ->private_data contains a pointer to the old ruleset of the process,
>>>>>>> as well as a pointer to a new empty ruleset.
>>>>>>>  - LANDLOCK_CMD_ADD_RULE fails if the specified rule would not be
>>>>>>> permitted by the old ruleset, then adds the rule to the new ruleset
>>>>>>>  - LANDLOCK_CMD_ENFORCE_RULESET fails if the old ruleset pointer in
>>>>>>> ->private_data doesn't match the current ruleset of the process, then
>>>>>>> replaces the old ruleset with the new ruleset.
>>>>>>>
>>>>>>> With this, the new ruleset is guaranteed to be a subset of the old
>>>>>>> ruleset because each of the new ruleset's rules is permitted by the
>>>>>>> old ruleset. (Unless the directory hierarchy rotates, but in that case
>>>>>>> the inaccuracy isn't much worse than what would've been possible
>>>>>>> through RCU path walk anyway AFAIK.)
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>
>>>>>> I would prefer to add the same checks you described at first (with
>>>>>> check_access_path), but only when creating a new ruleset with
>>>>>> merge_ruleset() (which should probably be renamed). This enables not to
>>>>>> rely on a parent ruleset/domain until the enforcement, which is the case
>>>>>> anyway.
>>>>>> Unfortunately this doesn't work for some cases with bind mounts. Because
>>>>>> check_access_path() goes through one path, another (bind mounted) path
>>>>>> could be illegitimately allowed.
>>>>>
>>>>> Hmm... I'm not sure what you mean. At the moment, landlock doesn't
>>>>> allow any sandboxed process to change the mount hierarchy, right? Can
>>>>> you give an example where this would go wrong?
>>>>
>>>> Indeed, a Landlocked process must no be able to change its mount
>>>> namespace layout. However, bind mounts may already exist.
>>>> Let's say a process sandbox itself to only access /a in a read-write
>>>> way.
>>>
>>> So, first policy:
>>>
>>> /a RW
>>>
>>>> Then, this process (or one of its children) add a new restriction
>>>> on /a/b to only be able to read this hierarchy.
>>>
>>> You mean with the second policy looking like this?
>>
>> Right.
>>
>>>
>>> /a RW
>>> /a/b R
>>>
>>> Then the resulting policy would be:
>>>
>>> /a RW policy_bitmask=0x00000003 (bits 0 and 1 set)
>>> /a/b R policy_bitmask=0x00000002 (bit 1 set)
>>> required_bits=0x00000003 (bits 0 and 1 set)
>>>
>>>> The check at insertion
>>>> time would allow this because this access right is a subset of the
>>>> access right allowed with the parent directory. However, If /a/b is bind
>>>> mounted somewhere else, let's say in /private/b, then the second
>>>> enforcement just gave new access rights to this hierarchy too.
>>>
>>> But with the solution I proposed, landlock's path walk would see
>>> something like this when accessing a file at /private/b/foo:
>>> /private/b/foo <no rules>
>>>   policies seen until now: 0x00000000
>>> /private/b <access: R, policy_bitmask=0x00000002>
>>>   policies seen until now: 0x00000002
>>> /private <no rules>
>>>   policies seen until now: 0x00000002
>>> / <no rules>
>>>   policies seen until now: 0x00000002
>>>
>>> It wouldn't encounter any rule from the first policy, so the OR of the
>>> seen policy bitmasks would be 0x00000002, which is not the required
>>> value 0x00000003, and so the access would be denied.
>> As I understand your proposition, we need to build the required_bits
>> when adding a rule or enforcing/merging a ruleset with a domain. The
>> issue is that a rule only refers to a struct inode, not a struct path.
>> For your proposition to work, we would need to walk through the file
>> path when adding a rule to a ruleset, which means that we need to depend
>> of the current view of the process (i.e. its mount namespace), and its
>> Landlock domain.
> 
> I don't see why that is necessary. Why would we have to walk the file
> path when adding a rule?
> 
>> If the required_bits field is set when the ruleset is
>> merged with the domain, it is not possible anymore to walk through the
>> corresponding initial file path, which makes the enforcement step too
>> late to check for such consistency. The important point is that a
>> ruleset/domain doesn't have a notion of file hierarchy, a ruleset is
>> only a set of tagged inodes.
>>
>> I'm not sure I got your proposition right, though. When and how would
>> you generate the required_bits?
> 
> Using your terminology:
> A domain is a collection of N layers, which are assigned indices 0..N-1.
> For each possible access type, a domain has a bitmask containing N
> bits that stores which layers control that access type. (Basically a
> per-layer version of fs_access_mask.)

OK, so there is a bit for each domain, which means that you get a limit
of, let's say 64 layers? Knowing that each layer can be created by a
standalone application, potentially nested in a bunch of layers, this
seems artificially limiting.

> To validate an access, you start by ORing together the bitmasks for
> the requested access types; that gives you the required_bits mask,
> which lists all layers that want to control the access.
> Then you set seen_policy_bits=0, then do the
> check_access_path_continue() loop while keeping track of which layers
> you've seen with "seen_policy_bits |= access->contributing_policies",
> or something like that.
> And in the end, you check that seen_policy_bits is a superset of
> required_bits - something like `(~seen_policy_bits) & required_bits ==
> 0`.
> 
> AFAICS to create a new domain from a bunch of layers, you wouldn't
> have to do any path walking.

Right, I misunderstood your previous email.

> 
>> Here is my updated proposition: add a layer level and a depth to each
>> rule (once enforced/merged with a domain), and a top layer level for a
>> domain. When enforcing a ruleset (i.e. merging a ruleset into the
>> current domain), the layer level of a new rule would be the incremented
>> top layer level.
>> If there is no rule (from this domain) tied to the same
>> inode, then the depth of the new rule is 1. However, if there is already
>> a rule tied to the same inode and if this rule's layer level is the
>> previous top layer level, then the depth and the layer level are both
>> incremented and the rule is updated with the new access rights (boolean
>> AND).
>>
>> The policy looks like this:
>> domain top_layer=2
>> /a RW policy_bitmask=0x00000003 layer=1 depth=1
>> /a/b R policy_bitmask=0x00000002 layer=2 depth=1
>>
>> The path walk access check walks through all inodes and start with a
>> layer counter equal to the top layer of the current domain. For each
>> encountered inode tied to a rule, the access rights are checked and a
>> new check ensures that the layer of the matching rule is the same as the
>> counter (this may be a merged ruleset containing rules pertaining to the
>> same hierarchy, which is fine) or equal to the decremented counter (i.e.
>> the path walk just reached the underlying layer). If the path walk
>> encounter a rule with a layer strictly less than the counter minus one,
>> there is a whole in the layers which means that the ruleset
>> hierarchy/subset does not match, and the access must be denied.
>>
>> When accessing a file at /private/b/foo for a read access:
>> /private/b/foo <no rules>
>>   allowed_access=unknown layer_counter=2
>> /private/b <access: R, policy_bitmask=0x00000002, layer=2, depth=1>
>>   allowed_access=allowed layer_counter=2
>> /private <no rules>
>>   allowed_access=allowed layer_counter=2
>> / <no rules>
>>   allowed_access=allowed layer_counter=2
>>
>> Because the layer_counter didn't reach 1, the access request is then denied.
>>
>> This proposition enables not to rely on a parent ruleset at first, only
>> when enforcing/merging a ruleset with a domain. This also solves the
>> issue with multiple inherited/nested rules on the same inode (in which
>> case the depth just grows). Moreover, this enables to safely stop the
>> path walk as soon as we reach the layer 1.
> 
> (FWIW, you could do the same optimization with the seen_policy_bits approach.)
> 
> I guess the difference between your proposal and mine is that in my
> proposal, the following would work, in effect permitting W access to
> /foo/bar/baz (and nothing else)?
> 
> first ruleset:
>   /foo W
> second ruleset:
>   /foo/bar/baz W
> third ruleset:
>   /foo/bar W
> 
> whereas in your proposal, IIUC it wouldn't be valid for a new ruleset
> to whitelist a superset of what was whitelisted in a previous ruleset?
> 

This behavior seems dangerous because a process which sandbox itself to
only access /foo/bar W can bypass the restrictions from one of its
parent domains (i.e. only access /foo/bar/baz W). Indeed, each layer is
(most of the time) a different and standalone security policy.

To sum up, the bitmask approach doesn't have the notion of layers
ordering. It is then not possible to check that a rule comes from a
domain which is the direct ancestor of a child's domain. I want each
policy/layer to be really nested in the sense that a process sandboxing
itself can only add more restriction to itself with regard to its parent
domain (and the whole hierarchy). This is a similar approach to
seccomp-bpf (with chained filters), except there is almost no overhead
to nest several policies/layers together because they are flattened.
Using the layer level and depth approach enables to implement this.
Jann Horn March 19, 2020, 9:17 p.m. UTC | #13
On Thu, Mar 19, 2020 at 5:58 PM Mickaël Salaün <mic@digikod.net> wrote:
> On 19/03/2020 00:33, Jann Horn wrote:
> > On Wed, Mar 18, 2020 at 1:06 PM Mickaël Salaün <mic@digikod.net> wrote:
[...]
> >> As I understand your proposition, we need to build the required_bits
> >> when adding a rule or enforcing/merging a ruleset with a domain. The
> >> issue is that a rule only refers to a struct inode, not a struct path.
> >> For your proposition to work, we would need to walk through the file
> >> path when adding a rule to a ruleset, which means that we need to depend
> >> of the current view of the process (i.e. its mount namespace), and its
> >> Landlock domain.
> >
> > I don't see why that is necessary. Why would we have to walk the file
> > path when adding a rule?
> >
> >> If the required_bits field is set when the ruleset is
> >> merged with the domain, it is not possible anymore to walk through the
> >> corresponding initial file path, which makes the enforcement step too
> >> late to check for such consistency. The important point is that a
> >> ruleset/domain doesn't have a notion of file hierarchy, a ruleset is
> >> only a set of tagged inodes.
> >>
> >> I'm not sure I got your proposition right, though. When and how would
> >> you generate the required_bits?
> >
> > Using your terminology:
> > A domain is a collection of N layers, which are assigned indices 0..N-1.
> > For each possible access type, a domain has a bitmask containing N
> > bits that stores which layers control that access type. (Basically a
> > per-layer version of fs_access_mask.)
>
> OK, so there is a bit for each domain, which means that you get a limit
> of, let's say 64 layers? Knowing that each layer can be created by a
> standalone application, potentially nested in a bunch of layers, this
> seems artificially limiting.

Yes, that is a downside of my approach.

> > To validate an access, you start by ORing together the bitmasks for
> > the requested access types; that gives you the required_bits mask,
> > which lists all layers that want to control the access.
> > Then you set seen_policy_bits=0, then do the
> > check_access_path_continue() loop while keeping track of which layers
> > you've seen with "seen_policy_bits |= access->contributing_policies",
> > or something like that.
> > And in the end, you check that seen_policy_bits is a superset of
> > required_bits - something like `(~seen_policy_bits) & required_bits ==
> > 0`.
> >
> > AFAICS to create a new domain from a bunch of layers, you wouldn't
> > have to do any path walking.
>
> Right, I misunderstood your previous email.
>
> >
> >> Here is my updated proposition: add a layer level and a depth to each
> >> rule (once enforced/merged with a domain), and a top layer level for a
> >> domain. When enforcing a ruleset (i.e. merging a ruleset into the
> >> current domain), the layer level of a new rule would be the incremented
> >> top layer level.
> >> If there is no rule (from this domain) tied to the same
> >> inode, then the depth of the new rule is 1. However, if there is already
> >> a rule tied to the same inode and if this rule's layer level is the
> >> previous top layer level, then the depth and the layer level are both
> >> incremented and the rule is updated with the new access rights (boolean
> >> AND).
> >>
> >> The policy looks like this:
> >> domain top_layer=2
> >> /a RW policy_bitmask=0x00000003 layer=1 depth=1
> >> /a/b R policy_bitmask=0x00000002 layer=2 depth=1
> >>
> >> The path walk access check walks through all inodes and start with a
> >> layer counter equal to the top layer of the current domain. For each
> >> encountered inode tied to a rule, the access rights are checked and a
> >> new check ensures that the layer of the matching rule is the same as the
> >> counter (this may be a merged ruleset containing rules pertaining to the
> >> same hierarchy, which is fine) or equal to the decremented counter (i.e.
> >> the path walk just reached the underlying layer). If the path walk
> >> encounter a rule with a layer strictly less than the counter minus one,
> >> there is a whole in the layers which means that the ruleset
> >> hierarchy/subset does not match, and the access must be denied.
> >>
> >> When accessing a file at /private/b/foo for a read access:
> >> /private/b/foo <no rules>
> >>   allowed_access=unknown layer_counter=2
> >> /private/b <access: R, policy_bitmask=0x00000002, layer=2, depth=1>
> >>   allowed_access=allowed layer_counter=2
> >> /private <no rules>
> >>   allowed_access=allowed layer_counter=2
> >> / <no rules>
> >>   allowed_access=allowed layer_counter=2
> >>
> >> Because the layer_counter didn't reach 1, the access request is then denied.
> >>
> >> This proposition enables not to rely on a parent ruleset at first, only
> >> when enforcing/merging a ruleset with a domain. This also solves the
> >> issue with multiple inherited/nested rules on the same inode (in which
> >> case the depth just grows). Moreover, this enables to safely stop the
> >> path walk as soon as we reach the layer 1.
> >
> > (FWIW, you could do the same optimization with the seen_policy_bits approach.)
> >
> > I guess the difference between your proposal and mine is that in my
> > proposal, the following would work, in effect permitting W access to
> > /foo/bar/baz (and nothing else)?
> >
> > first ruleset:
> >   /foo W
> > second ruleset:
> >   /foo/bar/baz W
> > third ruleset:
> >   /foo/bar W
> >
> > whereas in your proposal, IIUC it wouldn't be valid for a new ruleset
> > to whitelist a superset of what was whitelisted in a previous ruleset?
> >
>
> This behavior seems dangerous because a process which sandbox itself to
> only access /foo/bar W can bypass the restrictions from one of its
> parent domains (i.e. only access /foo/bar/baz W). Indeed, each layer is
> (most of the time) a different and standalone security policy.

It isn't actually bypassing the restriction: You still can't actually
access files like /foo/bar/blah, because a path walk from there
doesn't encounter any rules from the second ruleset.

> To sum up, the bitmask approach doesn't have the notion of layers
> ordering. It is then not possible to check that a rule comes from a
> domain which is the direct ancestor of a child's domain. I want each
> policy/layer to be really nested in the sense that a process sandboxing
> itself can only add more restriction to itself with regard to its parent
> domain (and the whole hierarchy). This is a similar approach to
> seccomp-bpf (with chained filters), except there is almost no overhead
> to nest several policies/layers together because they are flattened.
> Using the layer level and depth approach enables to implement this.
Mickaël Salaün March 30, 2020, 6:26 p.m. UTC | #14
On 19/03/2020 22:17, Jann Horn wrote:
> On Thu, Mar 19, 2020 at 5:58 PM Mickaël Salaün <mic@digikod.net> wrote:
>> On 19/03/2020 00:33, Jann Horn wrote:
>>> On Wed, Mar 18, 2020 at 1:06 PM Mickaël Salaün <mic@digikod.net> wrote:
> [...]
>>>> As I understand your proposition, we need to build the required_bits
>>>> when adding a rule or enforcing/merging a ruleset with a domain. The
>>>> issue is that a rule only refers to a struct inode, not a struct path.
>>>> For your proposition to work, we would need to walk through the file
>>>> path when adding a rule to a ruleset, which means that we need to depend
>>>> of the current view of the process (i.e. its mount namespace), and its
>>>> Landlock domain.
>>>
>>> I don't see why that is necessary. Why would we have to walk the file
>>> path when adding a rule?
>>>
>>>> If the required_bits field is set when the ruleset is
>>>> merged with the domain, it is not possible anymore to walk through the
>>>> corresponding initial file path, which makes the enforcement step too
>>>> late to check for such consistency. The important point is that a
>>>> ruleset/domain doesn't have a notion of file hierarchy, a ruleset is
>>>> only a set of tagged inodes.
>>>>
>>>> I'm not sure I got your proposition right, though. When and how would
>>>> you generate the required_bits?
>>>
>>> Using your terminology:
>>> A domain is a collection of N layers, which are assigned indices 0..N-1.
>>> For each possible access type, a domain has a bitmask containing N
>>> bits that stores which layers control that access type. (Basically a
>>> per-layer version of fs_access_mask.)
>>
>> OK, so there is a bit for each domain, which means that you get a limit
>> of, let's say 64 layers? Knowing that each layer can be created by a
>> standalone application, potentially nested in a bunch of layers, this
>> seems artificially limiting.
> 
> Yes, that is a downside of my approach.
> 
>>> To validate an access, you start by ORing together the bitmasks for
>>> the requested access types; that gives you the required_bits mask,
>>> which lists all layers that want to control the access.
>>> Then you set seen_policy_bits=0, then do the
>>> check_access_path_continue() loop while keeping track of which layers
>>> you've seen with "seen_policy_bits |= access->contributing_policies",
>>> or something like that.
>>> And in the end, you check that seen_policy_bits is a superset of
>>> required_bits - something like `(~seen_policy_bits) & required_bits ==
>>> 0`.
>>>
>>> AFAICS to create a new domain from a bunch of layers, you wouldn't
>>> have to do any path walking.
>>
>> Right, I misunderstood your previous email.
>>
>>>
>>>> Here is my updated proposition: add a layer level and a depth to each
>>>> rule (once enforced/merged with a domain), and a top layer level for a
>>>> domain. When enforcing a ruleset (i.e. merging a ruleset into the
>>>> current domain), the layer level of a new rule would be the incremented
>>>> top layer level.
>>>> If there is no rule (from this domain) tied to the same
>>>> inode, then the depth of the new rule is 1. However, if there is already
>>>> a rule tied to the same inode and if this rule's layer level is the
>>>> previous top layer level, then the depth and the layer level are both
>>>> incremented and the rule is updated with the new access rights (boolean
>>>> AND).
>>>>
>>>> The policy looks like this:
>>>> domain top_layer=2
>>>> /a RW policy_bitmask=0x00000003 layer=1 depth=1
>>>> /a/b R policy_bitmask=0x00000002 layer=2 depth=1
>>>>
>>>> The path walk access check walks through all inodes and start with a
>>>> layer counter equal to the top layer of the current domain. For each
>>>> encountered inode tied to a rule, the access rights are checked and a
>>>> new check ensures that the layer of the matching rule is the same as the
>>>> counter (this may be a merged ruleset containing rules pertaining to the
>>>> same hierarchy, which is fine) or equal to the decremented counter (i.e.
>>>> the path walk just reached the underlying layer). If the path walk
>>>> encounter a rule with a layer strictly less than the counter minus one,
>>>> there is a whole in the layers which means that the ruleset
>>>> hierarchy/subset does not match, and the access must be denied.
>>>>
>>>> When accessing a file at /private/b/foo for a read access:
>>>> /private/b/foo <no rules>
>>>>   allowed_access=unknown layer_counter=2
>>>> /private/b <access: R, policy_bitmask=0x00000002, layer=2, depth=1>
>>>>   allowed_access=allowed layer_counter=2
>>>> /private <no rules>
>>>>   allowed_access=allowed layer_counter=2
>>>> / <no rules>
>>>>   allowed_access=allowed layer_counter=2
>>>>
>>>> Because the layer_counter didn't reach 1, the access request is then denied.
>>>>
>>>> This proposition enables not to rely on a parent ruleset at first, only
>>>> when enforcing/merging a ruleset with a domain. This also solves the
>>>> issue with multiple inherited/nested rules on the same inode (in which
>>>> case the depth just grows). Moreover, this enables to safely stop the
>>>> path walk as soon as we reach the layer 1.
>>>
>>> (FWIW, you could do the same optimization with the seen_policy_bits approach.)
>>>
>>> I guess the difference between your proposal and mine is that in my
>>> proposal, the following would work, in effect permitting W access to
>>> /foo/bar/baz (and nothing else)?
>>>
>>> first ruleset:
>>>   /foo W
>>> second ruleset:
>>>   /foo/bar/baz W
>>> third ruleset:
>>>   /foo/bar W
>>>
>>> whereas in your proposal, IIUC it wouldn't be valid for a new ruleset
>>> to whitelist a superset of what was whitelisted in a previous ruleset?
>>>
>>
>> This behavior seems dangerous because a process which sandbox itself to
>> only access /foo/bar W can bypass the restrictions from one of its
>> parent domains (i.e. only access /foo/bar/baz W). Indeed, each layer is
>> (most of the time) a different and standalone security policy.
> 
> It isn't actually bypassing the restriction: You still can't actually
> access files like /foo/bar/blah, because a path walk from there
> doesn't encounter any rules from the second ruleset.

Right, this use case is legitimate, e.g. first giving access to
~/Downloads and then another layer giving access to ~/ (because it
doesn't know about the current restriction).

I think that neither my initial approach nor yours fit well, but I found
a new one inspired from both approaches. The first solution I gave, and
since implemented in the v15 [1], can manage 2^31-1 layers but it only
works when refining a security policy *knowing the parent one* (i.e.
refining an access tied to an inode, not a full file hierarchy). Instead
of having a layer level and a layer depth, my new implementation (for
v16) use a layer bitfield for each rule (and ruleset). We still AND
access rights when merging rulesets, but instead of storing the last
layer lever and depth, we set the corresponding bit in the layer
bitfield of the rule. This way we don't consume more memory than the v15
implementation (for 64 layers top, which would be 64 bits * number of
access types with your approach, i.e. between 1KB and 2KB) and Landlock
can properly manage supersets of access rights in nested hierarchies,
whatever their stacking order. However, I don't see another solution to
better handle more than 64 layers than a VLA, but that could come later.

[1] https://lore.kernel.org/lkml/20200326202731.693608-6-mic@digikod.net/

> 
>> To sum up, the bitmask approach doesn't have the notion of layers
>> ordering. It is then not possible to check that a rule comes from a
>> domain which is the direct ancestor of a child's domain. I want each
>> policy/layer to be really nested in the sense that a process sandboxing
>> itself can only add more restriction to itself with regard to its parent
>> domain (and the whole hierarchy). This is a similar approach to
>> seccomp-bpf (with chained filters), except there is almost no overhead
>> to nest several policies/layers together because they are flattened.
>> Using the layer level and depth approach enables to implement this.
>