diff mbox series

[v6,bpf-next,03/13] bpf: introduce BPF token object

Message ID 20230927225809.2049655-4-andrii@kernel.org (mailing list archive)
State New, archived
Headers show
Series BPF token and BPF FS-based delegation | expand

Commit Message

Andrii Nakryiko Sept. 27, 2023, 10:57 p.m. UTC
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while have a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
a path specification (using the usual fd + string path combo) to a BPF
FS mount. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the creation
time or after the fact, allowing the process to guard itself further
from, e.g., unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

The alternative to creating BPF token object was:
  a) not having any extra object and just pasing BPF FS path to each
     relevant bpf() command. This seems suboptimal as it's racy (mount
     under the same path might change in between checking it and using it
     for bpf() command). And also less flexible if we'd like to further
     restrict ourselves compared to all the delegated functionality
     allowed on BPF FS.
  b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
     a dedicated FD that would represent a token-like functionality. This
     doesn't seem superior to having a proper bpf() command, so
     BPF_TOKEN_CREATE was chosen.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h            |  40 +++++++
 include/uapi/linux/bpf.h       |  39 +++++++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/inode.c             |  10 +-
 kernel/bpf/syscall.c           |  17 +++
 kernel/bpf/token.c             | 197 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  39 +++++++
 7 files changed, 339 insertions(+), 5 deletions(-)
 create mode 100644 kernel/bpf/token.c

Comments

Paul Moore Oct. 11, 2023, 1:17 a.m. UTC | #1
On Sep 27, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> 
> Add new kind of BPF kernel object, BPF token. BPF token is meant to
> allow delegating privileged BPF functionality, like loading a BPF
> program or creating a BPF map, from privileged process to a *trusted*
> unprivileged process, all while have a good amount of control over which
> privileged operations could be performed using provided BPF token.
> 
> This is achieved through mounting BPF FS instance with extra delegation
> mount options, which determine what operations are delegatable, and also
> constraining it to the owning user namespace (as mentioned in the
> previous patch).
> 
> BPF token itself is just a derivative from BPF FS and can be created
> through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> a path specification (using the usual fd + string path combo) to a BPF
> FS mount. Currently, BPF token "inherits" delegated command, map types,
> prog type, and attach type bit sets from BPF FS as is. In the future,
> having an BPF token as a separate object with its own FD, we can allow
> to further restrict BPF token's allowable set of things either at the creation
> time or after the fact, allowing the process to guard itself further
> from, e.g., unintentionally trying to load undesired kind of BPF
> programs. But for now we keep things simple and just copy bit sets as is.
> 
> When BPF token is created from BPF FS mount, we take reference to the
> BPF super block's owning user namespace, and then use that namespace for
> checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> capabilities that are normally only checked against init userns (using
> capable()), but now we check them using ns_capable() instead (if BPF
> token is provided). See bpf_token_capable() for details.
> 
> Such setup means that BPF token in itself is not sufficient to grant BPF
> functionality. User namespaced process has to *also* have necessary
> combination of capabilities inside that user namespace. So while
> previously CAP_BPF was useless when granted within user namespace, now
> it gains a meaning and allows container managers and sys admins to have
> a flexible control over which processes can and need to use BPF
> functionality within the user namespace (i.e., container in practice).
> And BPF FS delegation mount options and derived BPF tokens serve as
> a per-container "flag" to grant overall ability to use bpf() (plus further
> restrict on which parts of bpf() syscalls are treated as namespaced).
> 
> The alternative to creating BPF token object was:
>   a) not having any extra object and just pasing BPF FS path to each
>      relevant bpf() command. This seems suboptimal as it's racy (mount
>      under the same path might change in between checking it and using it
>      for bpf() command). And also less flexible if we'd like to further
>      restrict ourselves compared to all the delegated functionality
>      allowed on BPF FS.
>   b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
>      a dedicated FD that would represent a token-like functionality. This
>      doesn't seem superior to having a proper bpf() command, so
>      BPF_TOKEN_CREATE was chosen.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/bpf.h            |  40 +++++++
>  include/uapi/linux/bpf.h       |  39 +++++++
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/inode.c             |  10 +-
>  kernel/bpf/syscall.c           |  17 +++
>  kernel/bpf/token.c             | 197 +++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  39 +++++++
>  7 files changed, 339 insertions(+), 5 deletions(-)
>  create mode 100644 kernel/bpf/token.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index a5bd40f71fd0..c43131a24579 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1572,6 +1576,13 @@ struct bpf_mount_opts {
>  	u64 delegate_attachs;
>  };
>  
> +struct bpf_token {
> +	struct work_struct work;
> +	atomic64_t refcnt;
> +	struct user_namespace *userns;
> +	u64 allowed_cmds;

We'll also need a 'void *security' field to go along with the BPF token
allocation/creation/free hooks, see my comments below.  This is similar
to what we do for other kernel objects.

> +};
> +

...

> diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> new file mode 100644
> index 000000000000..779aad5007a3
> --- /dev/null
> +++ b/kernel/bpf/token.c
> @@ -0,0 +1,197 @@
> +#include <linux/bpf.h>
> +#include <linux/vmalloc.h>
> +#include <linux/anon_inodes.h>

Probably don't need the anon_inode.h include anymore.

> +#include <linux/fdtable.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/kernel.h>
> +#include <linux/idr.h>
> +#include <linux/namei.h>
> +#include <linux/user_namespace.h>
> +
> +bool bpf_token_capable(const struct bpf_token *token, int cap)
> +{
> +	/* BPF token allows ns_capable() level of capabilities */
> +	if (token) {

I think we want a LSM hook here before the token is used in the
capability check.  The LSM will see the capability check, but it will
not be able to distinguish it from the process which created the
delegation token.  This is arguably the purpose of the delegation, but
with the LSM we want to be able to control who can use the delegated
privilege.  How about something like this:

  if (security_bpf_token_capable(token, cap))
     return false;

> +		if (ns_capable(token->userns, cap))
> +			return true;
> +		if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> +			return true;
> +	}
> +	/* otherwise fallback to capable() checks */
> +	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> +}
> +
> +void bpf_token_inc(struct bpf_token *token)
> +{
> +	atomic64_inc(&token->refcnt);
> +}
> +
> +static void bpf_token_free(struct bpf_token *token)
> +{

We should have a LSM hook here to handle freeing the LSM state
associated with the token.

  security_bpf_token_free(token);

> +	put_user_ns(token->userns);
> +	kvfree(token);
> +}

...

> +static struct bpf_token *bpf_token_alloc(void)
> +{
> +	struct bpf_token *token;
> +
> +	token = kvzalloc(sizeof(*token), GFP_USER);
> +	if (!token)
> +		return NULL;
> +
> +	atomic64_set(&token->refcnt, 1);

We should have a LSM hook here to allocate the LSM state associated
with the token.

  if (security_bpf_token_alloc(token)) {
    kvfree(token);
    return NULL;
  }

> +	return token;
> +}

...

> +int bpf_token_create(union bpf_attr *attr)
> +{
> +	struct bpf_mount_opts *mnt_opts;
> +	struct bpf_token *token = NULL;
> +	struct inode *inode;
> +	struct file *file;
> +	struct path path;
> +	umode_t mode;
> +	int err, fd;
> +
> +	err = user_path_at(attr->token_create.bpffs_path_fd,
> +			   u64_to_user_ptr(attr->token_create.bpffs_pathname),
> +			   LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> +	if (err)
> +		return err;
> +
> +	if (path.mnt->mnt_root != path.dentry) {
> +		err = -EINVAL;
> +		goto out_path;
> +	}
> +	err = path_permission(&path, MAY_ACCESS);
> +	if (err)
> +		goto out_path;
> +
> +	mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
> +	inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
> +	if (IS_ERR(inode)) {
> +		err = PTR_ERR(inode);
> +		goto out_path;
> +	}
> +
> +	inode->i_op = &bpf_token_iops;
> +	inode->i_fop = &bpf_token_fops;
> +	clear_nlink(inode); /* make sure it is unlinked */
> +
> +	file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> +	if (IS_ERR(file)) {
> +		iput(inode);
> +		err = PTR_ERR(file);
> +		goto out_file;
> +	}
> +
> +	token = bpf_token_alloc();
> +	if (!token) {
> +		err = -ENOMEM;
> +		goto out_file;
> +	}
> +
> +	/* remember bpffs owning userns for future ns_capable() checks */
> +	token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> +
> +	mnt_opts = path.dentry->d_sb->s_fs_info;
> +	token->allowed_cmds = mnt_opts->delegate_cmds;

I think we would want a LSM hook here, both to control the creation
of the token and mark it with the security attributes of the creating
process.  How about something like this:

  err = security_bpf_token_create(token);
  if (err)
    goto out_token;

> +	fd = get_unused_fd_flags(O_CLOEXEC);
> +	if (fd < 0) {
> +		err = fd;
> +		goto out_token;
> +	}
> +
> +	file->private_data = token;
> +	fd_install(fd, file);
> +
> +	path_put(&path);
> +	return fd;
> +
> +out_token:
> +	bpf_token_free(token);
> +out_file:
> +	fput(file);
> +out_path:
> +	path_put(&path);
> +	return err;
> +}

...

> +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> +{
> +	if (!token)
> +		return false;
> +
> +	return token->allowed_cmds & (1ULL << cmd);

Similar to bpf_token_capable(), I believe we want a LSM hook here to
control who is allowed to use the delegated privilege.

  bool bpf_token_allow_cmd(...)
  {
    if (token && (token->allowed_cmds & (1ULL << cmd))
      return security_bpf_token_cmd(token, cmd);
    return false;
  }

> +}

--
paul-moore.com
Hou Tao Oct. 11, 2023, 2:35 a.m. UTC | #2
Hi,

On 9/28/2023 6:57 AM, Andrii Nakryiko wrote:
> Add new kind of BPF kernel object, BPF token. BPF token is meant to
> allow delegating privileged BPF functionality, like loading a BPF
> program or creating a BPF map, from privileged process to a *trusted*
> unprivileged process, all while have a good amount of control over which
> privileged operations could be performed using provided BPF token.
>
> This is achieved through mounting BPF FS instance with extra delegation
> mount options, which determine what operations are delegatable, and also
> constraining it to the owning user namespace (as mentioned in the
> previous patch).
SNIP
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 70bfa997e896..78692911f4a0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -847,6 +847,37 @@ union bpf_iter_link_info {
>   *		Returns zero on success. On error, -1 is returned and *errno*
>   *		is set appropriately.
>   *
> + * BPF_TOKEN_CREATE
> + *	Description
> + *		Create BPF token with embedded information about what
> + *		BPF-related functionality it allows:
> + *		- a set of allowed bpf() syscall commands;
> + *		- a set of allowed BPF map types to be created with
> + *		BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
> + *		- a set of allowed BPF program types and BPF program attach
> + *		types to be loaded with BPF_PROG_LOAD command, if
> + *		BPF_PROG_LOAD itself is allowed.
> + *
> + *		BPF token is created (derived) from an instance of BPF FS,
> + *		assuming it has necessary delegation mount options specified.
> + *		BPF FS mount is specified with openat()-style path FD + string.
> + *		This BPF token can be passed as an extra parameter to various
> + *		bpf() syscall commands to grant BPF subsystem functionality to
> + *		unprivileged processes.
> + *
> + *		When created, BPF token is "associated" with the owning
> + *		user namespace of BPF FS instance (super block) that it was
> + *		derived from, and subsequent BPF operations performed with
> + *		BPF token would be performing capabilities checks (i.e.,
> + *		CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
> + *		that user namespace. Without BPF token, such capabilities
> + *		have to be granted in init user namespace, making bpf()
> + *		syscall incompatible with user namespace, for the most part.
> + *
> + *	Return
> + *		A new file descriptor (a nonnegative integer), or -1 if an
> + *		error occurred (in which case, *errno* is set appropriately).
> + *
>   * NOTES
>   *	eBPF objects (maps and programs) can be shared between processes.
>   *
> @@ -901,6 +932,8 @@ enum bpf_cmd {
>  	BPF_ITER_CREATE,
>  	BPF_LINK_DETACH,
>  	BPF_PROG_BIND_MAP,
> +	BPF_TOKEN_CREATE,
> +	__MAX_BPF_CMD,
>  };
>  
>  enum bpf_map_type {
> @@ -1694,6 +1727,12 @@ union bpf_attr {
>  		__u32		flags;		/* extra flags */
>  	} prog_bind_map;
>  
> +	struct { /* struct used by BPF_TOKEN_CREATE command */
> +		__u32		flags;
> +		__u32		bpffs_path_fd;
> +		__u64		bpffs_pathname;

Because bppfs_pathname is a string pointer, so __aligned_u64 is preferred.
> +	} token_create;
> +
>  } __attribute__((aligned(8)));
>  
>  /* The description below is an attempt at providing documentation to eBPF
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index f526b7573e97..4ce95acfcaa7 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
>  endif
>  CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>  
> -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
> +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 24b3faf901f4..de1fdf396521 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -99,9 +99,9 @@ static const struct inode_operations bpf_prog_iops = { };
>  static const struct inode_operations bpf_map_iops  = { };
>  static const struct inode_operations bpf_link_iops  = { };
>  
> -static struct inode *bpf_get_inode(struct super_block *sb,
> -				   const struct inode *dir,
> -				   umode_t mode)
> +struct inode *bpf_get_inode(struct super_block *sb,
> +			    const struct inode *dir,
> +			    umode_t mode)
>  {
>  	struct inode *inode;
>  
> @@ -603,11 +603,13 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
>  {
>  	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
>  	umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
> +	u64 mask;
>  
>  	if (mode != S_IRWXUGO)
>  		seq_printf(m, ",mode=%o", mode);
>  
> -	if (opts->delegate_cmds == ~0ULL)
> +	mask = (1ULL << __MAX_BPF_CMD) - 1;
> +	if ((opts->delegate_cmds & mask) == mask)
>  		seq_printf(m, ",delegate_cmds=any");

Should we add a BUILD_BUG_ON assertion to guarantee __MAX_BPF_CMD is
less than sizeof(u64) * 8 ?
>  	else if (opts->delegate_cmds)
>  		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 7445dad01fb3..b47791a80930 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -5304,6 +5304,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
>  	return ret;
>  }
>  
> +#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname
> +
> +static int token_create(union bpf_attr *attr)
> +{
> +	if (CHECK_ATTR(BPF_TOKEN_CREATE))
> +		return -EINVAL;
> +
> +	/* no flags are supported yet */
> +	if (attr->token_create.flags)
> +		return -EINVAL;
> +
> +	return bpf_token_create(attr);
> +}
> +
>  static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
>  {
>  	union bpf_attr attr;
> @@ -5437,6 +5451,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
>  	case BPF_PROG_BIND_MAP:
>  		err = bpf_prog_bind_map(&attr);
>  		break;
> +	case BPF_TOKEN_CREATE:
> +		err = token_create(&attr);
> +		break;
>  	default:
>  		err = -EINVAL;
>  		break;
> diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> new file mode 100644
> index 000000000000..779aad5007a3
> --- /dev/null
> +++ b/kernel/bpf/token.c
SNIP
> +#define BPF_TOKEN_INODE_NAME "bpf-token"
> +
> +static const struct inode_operations bpf_token_iops = { };
> +
> +static const struct file_operations bpf_token_fops = {
> +	.release	= bpf_token_release,
> +	.show_fdinfo	= bpf_token_show_fdinfo,
> +};
> +
> +int bpf_token_create(union bpf_attr *attr)
> +{
> +	struct bpf_mount_opts *mnt_opts;
> +	struct bpf_token *token = NULL;
> +	struct inode *inode;
> +	struct file *file;
> +	struct path path;
> +	umode_t mode;
> +	int err, fd;
> +
> +	err = user_path_at(attr->token_create.bpffs_path_fd,
> +			   u64_to_user_ptr(attr->token_create.bpffs_pathname),
> +			   LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> +	if (err)
> +		return err;

Need to check the mount is a bpffs mount instead of other filesystem mount.
> +
> +	if (path.mnt->mnt_root != path.dentry) {
> +		err = -EINVAL;
> +		goto out_path;
> +	}
> +	err = path_permission(&path, MAY_ACCESS);
> +	if (err)
> +		goto out_path;
> +
> +	mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
> +	inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
> +	if (IS_ERR(inode)) {
> +		err = PTR_ERR(inode);
> +		goto out_path;
> +	}
> +
> +	inode->i_op = &bpf_token_iops;
> +	inode->i_fop = &bpf_token_fops;
> +	clear_nlink(inode); /* make sure it is unlinked */
> +
> +	file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> +	if (IS_ERR(file)) {
> +		iput(inode);
> +		err = PTR_ERR(file);
> +		goto out_file;

goto out_path ?
> +	}
> +
> +	token = bpf_token_alloc();
> +	if (!token) {
> +		err = -ENOMEM;
> +		goto out_file;
> +	}
> +
> +	/* remember bpffs owning userns for future ns_capable() checks */
> +	token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> +
> +	mnt_opts = path.dentry->d_sb->s_fs_info;
> +	token->allowed_cmds = mnt_opts->delegate_cmds;
> +
> +	fd = get_unused_fd_flags(O_CLOEXEC);
> +	if (fd < 0) {
> +		err = fd;
> +		goto out_token;
> +	}
> +
> +	file->private_data = token;
> +	fd_install(fd, file);
> +
> +	path_put(&path);
> +	return fd;
> +
> +out_token:
> +	bpf_token_free(token);
> +out_file:
> +	fput(file);
> +out_path:
> +	path_put(&path);
> +	return err;
> +}
> +
.
Andrii Nakryiko Oct. 12, 2023, 12:31 a.m. UTC | #3
On Tue, Oct 10, 2023 at 7:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/28/2023 6:57 AM, Andrii Nakryiko wrote:
> > Add new kind of BPF kernel object, BPF token. BPF token is meant to
> > allow delegating privileged BPF functionality, like loading a BPF
> > program or creating a BPF map, from privileged process to a *trusted*
> > unprivileged process, all while have a good amount of control over which
> > privileged operations could be performed using provided BPF token.
> >
> > This is achieved through mounting BPF FS instance with extra delegation
> > mount options, which determine what operations are delegatable, and also
> > constraining it to the owning user namespace (as mentioned in the
> > previous patch).
> SNIP
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 70bfa997e896..78692911f4a0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -847,6 +847,37 @@ union bpf_iter_link_info {
> >   *           Returns zero on success. On error, -1 is returned and *errno*
> >   *           is set appropriately.
> >   *
> > + * BPF_TOKEN_CREATE
> > + *   Description
> > + *           Create BPF token with embedded information about what
> > + *           BPF-related functionality it allows:
> > + *           - a set of allowed bpf() syscall commands;
> > + *           - a set of allowed BPF map types to be created with
> > + *           BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
> > + *           - a set of allowed BPF program types and BPF program attach
> > + *           types to be loaded with BPF_PROG_LOAD command, if
> > + *           BPF_PROG_LOAD itself is allowed.
> > + *
> > + *           BPF token is created (derived) from an instance of BPF FS,
> > + *           assuming it has necessary delegation mount options specified.
> > + *           BPF FS mount is specified with openat()-style path FD + string.
> > + *           This BPF token can be passed as an extra parameter to various
> > + *           bpf() syscall commands to grant BPF subsystem functionality to
> > + *           unprivileged processes.
> > + *
> > + *           When created, BPF token is "associated" with the owning
> > + *           user namespace of BPF FS instance (super block) that it was
> > + *           derived from, and subsequent BPF operations performed with
> > + *           BPF token would be performing capabilities checks (i.e.,
> > + *           CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
> > + *           that user namespace. Without BPF token, such capabilities
> > + *           have to be granted in init user namespace, making bpf()
> > + *           syscall incompatible with user namespace, for the most part.
> > + *
> > + *   Return
> > + *           A new file descriptor (a nonnegative integer), or -1 if an
> > + *           error occurred (in which case, *errno* is set appropriately).
> > + *
> >   * NOTES
> >   *   eBPF objects (maps and programs) can be shared between processes.
> >   *
> > @@ -901,6 +932,8 @@ enum bpf_cmd {
> >       BPF_ITER_CREATE,
> >       BPF_LINK_DETACH,
> >       BPF_PROG_BIND_MAP,
> > +     BPF_TOKEN_CREATE,
> > +     __MAX_BPF_CMD,
> >  };
> >
> >  enum bpf_map_type {
> > @@ -1694,6 +1727,12 @@ union bpf_attr {
> >               __u32           flags;          /* extra flags */
> >       } prog_bind_map;
> >
> > +     struct { /* struct used by BPF_TOKEN_CREATE command */
> > +             __u32           flags;
> > +             __u32           bpffs_path_fd;
> > +             __u64           bpffs_pathname;
>
> Because bppfs_pathname is a string pointer, so __aligned_u64 is preferred.

ok, I'll use __aligned_u64, even though it can never be unaligned in this case


> > +     } token_create;
> > +
> >  } __attribute__((aligned(8)));
> >
> >  /* The description below is an attempt at providing documentation to eBPF
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index f526b7573e97..4ce95acfcaa7 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
> >  endif
> >  CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
> >
> > -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
> > +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
> >  obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
> >  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> >  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > index 24b3faf901f4..de1fdf396521 100644
> > --- a/kernel/bpf/inode.c
> > +++ b/kernel/bpf/inode.c
> > @@ -99,9 +99,9 @@ static const struct inode_operations bpf_prog_iops = { };
> >  static const struct inode_operations bpf_map_iops  = { };
> >  static const struct inode_operations bpf_link_iops  = { };
> >
> > -static struct inode *bpf_get_inode(struct super_block *sb,
> > -                                const struct inode *dir,
> > -                                umode_t mode)
> > +struct inode *bpf_get_inode(struct super_block *sb,
> > +                         const struct inode *dir,
> > +                         umode_t mode)
> >  {
> >       struct inode *inode;
> >
> > @@ -603,11 +603,13 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
> >  {
> >       struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
> >       umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
> > +     u64 mask;
> >
> >       if (mode != S_IRWXUGO)
> >               seq_printf(m, ",mode=%o", mode);
> >
> > -     if (opts->delegate_cmds == ~0ULL)
> > +     mask = (1ULL << __MAX_BPF_CMD) - 1;
> > +     if ((opts->delegate_cmds & mask) == mask)
> >               seq_printf(m, ",delegate_cmds=any");
>
> Should we add a BUILD_BUG_ON assertion to guarantee __MAX_BPF_CMD is
> less than sizeof(u64) * 8 ?

yep, good idea, will add for CMD and all others

> >       else if (opts->delegate_cmds)
> >               seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 7445dad01fb3..b47791a80930 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -5304,6 +5304,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
> >       return ret;
> >  }
> >
> > +#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname
> > +
> > +static int token_create(union bpf_attr *attr)
> > +{
> > +     if (CHECK_ATTR(BPF_TOKEN_CREATE))
> > +             return -EINVAL;
> > +
> > +     /* no flags are supported yet */
> > +     if (attr->token_create.flags)
> > +             return -EINVAL;
> > +
> > +     return bpf_token_create(attr);
> > +}
> > +
> >  static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
> >  {
> >       union bpf_attr attr;
> > @@ -5437,6 +5451,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
> >       case BPF_PROG_BIND_MAP:
> >               err = bpf_prog_bind_map(&attr);
> >               break;
> > +     case BPF_TOKEN_CREATE:
> > +             err = token_create(&attr);
> > +             break;
> >       default:
> >               err = -EINVAL;
> >               break;
> > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> > new file mode 100644
> > index 000000000000..779aad5007a3
> > --- /dev/null
> > +++ b/kernel/bpf/token.c
> SNIP
> > +#define BPF_TOKEN_INODE_NAME "bpf-token"
> > +
> > +static const struct inode_operations bpf_token_iops = { };
> > +
> > +static const struct file_operations bpf_token_fops = {
> > +     .release        = bpf_token_release,
> > +     .show_fdinfo    = bpf_token_show_fdinfo,
> > +};
> > +
> > +int bpf_token_create(union bpf_attr *attr)
> > +{
> > +     struct bpf_mount_opts *mnt_opts;
> > +     struct bpf_token *token = NULL;
> > +     struct inode *inode;
> > +     struct file *file;
> > +     struct path path;
> > +     umode_t mode;
> > +     int err, fd;
> > +
> > +     err = user_path_at(attr->token_create.bpffs_path_fd,
> > +                        u64_to_user_ptr(attr->token_create.bpffs_pathname),
> > +                        LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> > +     if (err)
> > +             return err;
>
> Need to check the mount is a bpffs mount instead of other filesystem mount.

yep, missed that. Fixed, will check `path.mnt->mnt_sb->s_op != &bpf_super_ops`.

> > +
> > +     if (path.mnt->mnt_root != path.dentry) {
> > +             err = -EINVAL;
> > +             goto out_path;
> > +     }
> > +     err = path_permission(&path, MAY_ACCESS);
> > +     if (err)
> > +             goto out_path;
> > +
> > +     mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
> > +     inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
> > +     if (IS_ERR(inode)) {
> > +             err = PTR_ERR(inode);
> > +             goto out_path;
> > +     }
> > +
> > +     inode->i_op = &bpf_token_iops;
> > +     inode->i_fop = &bpf_token_fops;
> > +     clear_nlink(inode); /* make sure it is unlinked */
> > +
> > +     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > +     if (IS_ERR(file)) {
> > +             iput(inode);
> > +             err = PTR_ERR(file);
> > +             goto out_file;
>
> goto out_path ?

eagle eye, fixed, thanks!


> > +     }
> > +
> > +     token = bpf_token_alloc();
> > +     if (!token) {
> > +             err = -ENOMEM;
> > +             goto out_file;
> > +     }
> > +
> > +     /* remember bpffs owning userns for future ns_capable() checks */
> > +     token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> > +
> > +     mnt_opts = path.dentry->d_sb->s_fs_info;
> > +     token->allowed_cmds = mnt_opts->delegate_cmds;
> > +
> > +     fd = get_unused_fd_flags(O_CLOEXEC);
> > +     if (fd < 0) {
> > +             err = fd;
> > +             goto out_token;
> > +     }
> > +
> > +     file->private_data = token;
> > +     fd_install(fd, file);
> > +
> > +     path_put(&path);
> > +     return fd;
> > +
> > +out_token:
> > +     bpf_token_free(token);
> > +out_file:
> > +     fput(file);
> > +out_path:
> > +     path_put(&path);
> > +     return err;
> > +}
> > +
> .
>
Andrii Nakryiko Oct. 12, 2023, 12:31 a.m. UTC | #4
On Tue, Oct 10, 2023 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Sep 27, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Add new kind of BPF kernel object, BPF token. BPF token is meant to
> > allow delegating privileged BPF functionality, like loading a BPF
> > program or creating a BPF map, from privileged process to a *trusted*
> > unprivileged process, all while have a good amount of control over which
> > privileged operations could be performed using provided BPF token.
> >
> > This is achieved through mounting BPF FS instance with extra delegation
> > mount options, which determine what operations are delegatable, and also
> > constraining it to the owning user namespace (as mentioned in the
> > previous patch).
> >
> > BPF token itself is just a derivative from BPF FS and can be created
> > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> > a path specification (using the usual fd + string path combo) to a BPF
> > FS mount. Currently, BPF token "inherits" delegated command, map types,
> > prog type, and attach type bit sets from BPF FS as is. In the future,
> > having an BPF token as a separate object with its own FD, we can allow
> > to further restrict BPF token's allowable set of things either at the creation
> > time or after the fact, allowing the process to guard itself further
> > from, e.g., unintentionally trying to load undesired kind of BPF
> > programs. But for now we keep things simple and just copy bit sets as is.
> >
> > When BPF token is created from BPF FS mount, we take reference to the
> > BPF super block's owning user namespace, and then use that namespace for
> > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> > capabilities that are normally only checked against init userns (using
> > capable()), but now we check them using ns_capable() instead (if BPF
> > token is provided). See bpf_token_capable() for details.
> >
> > Such setup means that BPF token in itself is not sufficient to grant BPF
> > functionality. User namespaced process has to *also* have necessary
> > combination of capabilities inside that user namespace. So while
> > previously CAP_BPF was useless when granted within user namespace, now
> > it gains a meaning and allows container managers and sys admins to have
> > a flexible control over which processes can and need to use BPF
> > functionality within the user namespace (i.e., container in practice).
> > And BPF FS delegation mount options and derived BPF tokens serve as
> > a per-container "flag" to grant overall ability to use bpf() (plus further
> > restrict on which parts of bpf() syscalls are treated as namespaced).
> >
> > The alternative to creating BPF token object was:
> >   a) not having any extra object and just pasing BPF FS path to each
> >      relevant bpf() command. This seems suboptimal as it's racy (mount
> >      under the same path might change in between checking it and using it
> >      for bpf() command). And also less flexible if we'd like to further
> >      restrict ourselves compared to all the delegated functionality
> >      allowed on BPF FS.
> >   b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
> >      a dedicated FD that would represent a token-like functionality. This
> >      doesn't seem superior to having a proper bpf() command, so
> >      BPF_TOKEN_CREATE was chosen.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf.h            |  40 +++++++
> >  include/uapi/linux/bpf.h       |  39 +++++++
> >  kernel/bpf/Makefile            |   2 +-
> >  kernel/bpf/inode.c             |  10 +-
> >  kernel/bpf/syscall.c           |  17 +++
> >  kernel/bpf/token.c             | 197 +++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |  39 +++++++
> >  7 files changed, 339 insertions(+), 5 deletions(-)
> >  create mode 100644 kernel/bpf/token.c
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index a5bd40f71fd0..c43131a24579 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1572,6 +1576,13 @@ struct bpf_mount_opts {
> >       u64 delegate_attachs;
> >  };
> >
> > +struct bpf_token {
> > +     struct work_struct work;
> > +     atomic64_t refcnt;
> > +     struct user_namespace *userns;
> > +     u64 allowed_cmds;
>
> We'll also need a 'void *security' field to go along with the BPF token
> allocation/creation/free hooks, see my comments below.  This is similar
> to what we do for other kernel objects.
>

ok, I'm thinking of adding a dedicated patch for all the
security-related stuff and refactoring of existing LSM hook(s).

> > +};
> > +
>
> ...
>
> > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> > new file mode 100644
> > index 000000000000..779aad5007a3
> > --- /dev/null
> > +++ b/kernel/bpf/token.c
> > @@ -0,0 +1,197 @@
> > +#include <linux/bpf.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/anon_inodes.h>
>
> Probably don't need the anon_inode.h include anymore.

yep, dropped

>
> > +#include <linux/fdtable.h>
> > +#include <linux/file.h>
> > +#include <linux/fs.h>
> > +#include <linux/kernel.h>
> > +#include <linux/idr.h>
> > +#include <linux/namei.h>
> > +#include <linux/user_namespace.h>
> > +
> > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > +{
> > +     /* BPF token allows ns_capable() level of capabilities */
> > +     if (token) {
>
> I think we want a LSM hook here before the token is used in the
> capability check.  The LSM will see the capability check, but it will
> not be able to distinguish it from the process which created the
> delegation token.  This is arguably the purpose of the delegation, but
> with the LSM we want to be able to control who can use the delegated
> privilege.  How about something like this:
>
>   if (security_bpf_token_capable(token, cap))
>      return false;

sounds good, I'll add this hook

btw, I'm thinking of guarding the BPF_TOKEN_CREATE command behind the
ns_capable(CAP_BPF) check, WDYT? This seems appropriate. You can get
BPF token only if you have CAP_BPF **within the userns**, so any
process not granted CAP_BPF within namespace ("container") is
guaranteed to not be able to do anything with BPF token.

>
> > +             if (ns_capable(token->userns, cap))
> > +                     return true;
> > +             if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > +                     return true;
> > +     }
> > +     /* otherwise fallback to capable() checks */
> > +     return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > +}
> > +
> > +void bpf_token_inc(struct bpf_token *token)
> > +{
> > +     atomic64_inc(&token->refcnt);
> > +}
> > +
> > +static void bpf_token_free(struct bpf_token *token)
> > +{
>
> We should have a LSM hook here to handle freeing the LSM state
> associated with the token.
>
>   security_bpf_token_free(token);
>

yep

> > +     put_user_ns(token->userns);
> > +     kvfree(token);
> > +}
>
> ...
>
> > +static struct bpf_token *bpf_token_alloc(void)
> > +{
> > +     struct bpf_token *token;
> > +
> > +     token = kvzalloc(sizeof(*token), GFP_USER);
> > +     if (!token)
> > +             return NULL;
> > +
> > +     atomic64_set(&token->refcnt, 1);
>
> We should have a LSM hook here to allocate the LSM state associated
> with the token.
>
>   if (security_bpf_token_alloc(token)) {
>     kvfree(token);
>     return NULL;
>   }
>
> > +     return token;
> > +}
>
> ...
>

Would having userns and allowed_* masks filled out by that time inside
the token be useful (seems so if we treat bpf_token_alloc as generic
LSM hook). If yes, I'll add security_bpf_token_alloc() after all that
is filled out, right before we try to get unused fd. WDYT?


> > +int bpf_token_create(union bpf_attr *attr)
> > +{
> > +     struct bpf_mount_opts *mnt_opts;
> > +     struct bpf_token *token = NULL;
> > +     struct inode *inode;
> > +     struct file *file;
> > +     struct path path;
> > +     umode_t mode;
> > +     int err, fd;
> > +
> > +     err = user_path_at(attr->token_create.bpffs_path_fd,
> > +                        u64_to_user_ptr(attr->token_create.bpffs_pathname),
> > +                        LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> > +     if (err)
> > +             return err;
> > +
> > +     if (path.mnt->mnt_root != path.dentry) {
> > +             err = -EINVAL;
> > +             goto out_path;
> > +     }
> > +     err = path_permission(&path, MAY_ACCESS);
> > +     if (err)
> > +             goto out_path;
> > +
> > +     mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
> > +     inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
> > +     if (IS_ERR(inode)) {
> > +             err = PTR_ERR(inode);
> > +             goto out_path;
> > +     }
> > +
> > +     inode->i_op = &bpf_token_iops;
> > +     inode->i_fop = &bpf_token_fops;
> > +     clear_nlink(inode); /* make sure it is unlinked */
> > +
> > +     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > +     if (IS_ERR(file)) {
> > +             iput(inode);
> > +             err = PTR_ERR(file);
> > +             goto out_file;
> > +     }
> > +
> > +     token = bpf_token_alloc();
> > +     if (!token) {
> > +             err = -ENOMEM;
> > +             goto out_file;
> > +     }
> > +
> > +     /* remember bpffs owning userns for future ns_capable() checks */
> > +     token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> > +
> > +     mnt_opts = path.dentry->d_sb->s_fs_info;
> > +     token->allowed_cmds = mnt_opts->delegate_cmds;
>
> I think we would want a LSM hook here, both to control the creation
> of the token and mark it with the security attributes of the creating
> process.  How about something like this:
>
>   err = security_bpf_token_create(token);
>   if (err)
>     goto out_token;

hmm... so you'd like both security_bpf_token_alloc() and
security_bpf_token_create()? They seem almost identical, do we need
two? Or is it that the security_bpf_token_alloc() is supposed to be
only used to create those `void *security` context pieces, while
security_bpf_token_create() is actually going to be used for
enforcement? For my own education, is there some explicit flag or some
other sort of mark between LSM hooks for setting up security vs
enforcement? Or is it mostly based on convention and implicitly
following the split?

>
> > +     fd = get_unused_fd_flags(O_CLOEXEC);
> > +     if (fd < 0) {
> > +             err = fd;
> > +             goto out_token;
> > +     }
> > +
> > +     file->private_data = token;
> > +     fd_install(fd, file);
> > +
> > +     path_put(&path);
> > +     return fd;
> > +
> > +out_token:
> > +     bpf_token_free(token);
> > +out_file:
> > +     fput(file);
> > +out_path:
> > +     path_put(&path);
> > +     return err;
> > +}
>
> ...
>
> > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> > +{
> > +     if (!token)
> > +             return false;
> > +
> > +     return token->allowed_cmds & (1ULL << cmd);
>
> Similar to bpf_token_capable(), I believe we want a LSM hook here to
> control who is allowed to use the delegated privilege.
>
>   bool bpf_token_allow_cmd(...)
>   {
>     if (token && (token->allowed_cmds & (1ULL << cmd))
>       return security_bpf_token_cmd(token, cmd);

ok, so I guess I'll have to add all four variants:
security_bpf_token_{cmd,map_type,prog_type,attach_type}, right?



>     return false;
>   }
>
> > +}
>
> --
> paul-moore.com
Andrii Nakryiko Oct. 12, 2023, 9:48 p.m. UTC | #5
On Wed, Oct 11, 2023 at 5:31 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Oct 10, 2023 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:
> >
> > On Sep 27, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Add new kind of BPF kernel object, BPF token. BPF token is meant to
> > > allow delegating privileged BPF functionality, like loading a BPF
> > > program or creating a BPF map, from privileged process to a *trusted*
> > > unprivileged process, all while have a good amount of control over which
> > > privileged operations could be performed using provided BPF token.
> > >
> > > This is achieved through mounting BPF FS instance with extra delegation
> > > mount options, which determine what operations are delegatable, and also
> > > constraining it to the owning user namespace (as mentioned in the
> > > previous patch).
> > >
> > > BPF token itself is just a derivative from BPF FS and can be created
> > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> > > a path specification (using the usual fd + string path combo) to a BPF
> > > FS mount. Currently, BPF token "inherits" delegated command, map types,
> > > prog type, and attach type bit sets from BPF FS as is. In the future,
> > > having an BPF token as a separate object with its own FD, we can allow
> > > to further restrict BPF token's allowable set of things either at the creation
> > > time or after the fact, allowing the process to guard itself further
> > > from, e.g., unintentionally trying to load undesired kind of BPF
> > > programs. But for now we keep things simple and just copy bit sets as is.
> > >
> > > When BPF token is created from BPF FS mount, we take reference to the
> > > BPF super block's owning user namespace, and then use that namespace for
> > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> > > capabilities that are normally only checked against init userns (using
> > > capable()), but now we check them using ns_capable() instead (if BPF
> > > token is provided). See bpf_token_capable() for details.
> > >
> > > Such setup means that BPF token in itself is not sufficient to grant BPF
> > > functionality. User namespaced process has to *also* have necessary
> > > combination of capabilities inside that user namespace. So while
> > > previously CAP_BPF was useless when granted within user namespace, now
> > > it gains a meaning and allows container managers and sys admins to have
> > > a flexible control over which processes can and need to use BPF
> > > functionality within the user namespace (i.e., container in practice).
> > > And BPF FS delegation mount options and derived BPF tokens serve as
> > > a per-container "flag" to grant overall ability to use bpf() (plus further
> > > restrict on which parts of bpf() syscalls are treated as namespaced).
> > >
> > > The alternative to creating BPF token object was:
> > >   a) not having any extra object and just pasing BPF FS path to each
> > >      relevant bpf() command. This seems suboptimal as it's racy (mount
> > >      under the same path might change in between checking it and using it
> > >      for bpf() command). And also less flexible if we'd like to further
> > >      restrict ourselves compared to all the delegated functionality
> > >      allowed on BPF FS.
> > >   b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
> > >      a dedicated FD that would represent a token-like functionality. This
> > >      doesn't seem superior to having a proper bpf() command, so
> > >      BPF_TOKEN_CREATE was chosen.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/bpf.h            |  40 +++++++
> > >  include/uapi/linux/bpf.h       |  39 +++++++
> > >  kernel/bpf/Makefile            |   2 +-
> > >  kernel/bpf/inode.c             |  10 +-
> > >  kernel/bpf/syscall.c           |  17 +++
> > >  kernel/bpf/token.c             | 197 +++++++++++++++++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h |  39 +++++++
> > >  7 files changed, 339 insertions(+), 5 deletions(-)
> > >  create mode 100644 kernel/bpf/token.c
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index a5bd40f71fd0..c43131a24579 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1572,6 +1576,13 @@ struct bpf_mount_opts {
> > >       u64 delegate_attachs;
> > >  };
> > >
> > > +struct bpf_token {
> > > +     struct work_struct work;
> > > +     atomic64_t refcnt;
> > > +     struct user_namespace *userns;
> > > +     u64 allowed_cmds;
> >
> > We'll also need a 'void *security' field to go along with the BPF token
> > allocation/creation/free hooks, see my comments below.  This is similar
> > to what we do for other kernel objects.
> >
>
> ok, I'm thinking of adding a dedicated patch for all the
> security-related stuff and refactoring of existing LSM hook(s).
>
> > > +};
> > > +
> >
> > ...
> >
> > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> > > new file mode 100644
> > > index 000000000000..779aad5007a3
> > > --- /dev/null
> > > +++ b/kernel/bpf/token.c
> > > @@ -0,0 +1,197 @@
> > > +#include <linux/bpf.h>
> > > +#include <linux/vmalloc.h>
> > > +#include <linux/anon_inodes.h>
> >
> > Probably don't need the anon_inode.h include anymore.
>
> yep, dropped
>
> >
> > > +#include <linux/fdtable.h>
> > > +#include <linux/file.h>
> > > +#include <linux/fs.h>
> > > +#include <linux/kernel.h>
> > > +#include <linux/idr.h>
> > > +#include <linux/namei.h>
> > > +#include <linux/user_namespace.h>
> > > +
> > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > +{
> > > +     /* BPF token allows ns_capable() level of capabilities */
> > > +     if (token) {
> >
> > I think we want a LSM hook here before the token is used in the
> > capability check.  The LSM will see the capability check, but it will
> > not be able to distinguish it from the process which created the
> > delegation token.  This is arguably the purpose of the delegation, but
> > with the LSM we want to be able to control who can use the delegated
> > privilege.  How about something like this:
> >
> >   if (security_bpf_token_capable(token, cap))
> >      return false;
>
> sounds good, I'll add this hook
>
> btw, I'm thinking of guarding the BPF_TOKEN_CREATE command behind the
> ns_capable(CAP_BPF) check, WDYT? This seems appropriate. You can get
> BPF token only if you have CAP_BPF **within the userns**, so any
> process not granted CAP_BPF within namespace ("container") is
> guaranteed to not be able to do anything with BPF token.
>
> >
> > > +             if (ns_capable(token->userns, cap))
> > > +                     return true;
> > > +             if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > +                     return true;
> > > +     }
> > > +     /* otherwise fallback to capable() checks */
> > > +     return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > +}
> > > +
> > > +void bpf_token_inc(struct bpf_token *token)
> > > +{
> > > +     atomic64_inc(&token->refcnt);
> > > +}
> > > +
> > > +static void bpf_token_free(struct bpf_token *token)
> > > +{
> >
> > We should have a LSM hook here to handle freeing the LSM state
> > associated with the token.
> >
> >   security_bpf_token_free(token);
> >
>
> yep
>
> > > +     put_user_ns(token->userns);
> > > +     kvfree(token);
> > > +}
> >
> > ...
> >
> > > +static struct bpf_token *bpf_token_alloc(void)
> > > +{
> > > +     struct bpf_token *token;
> > > +
> > > +     token = kvzalloc(sizeof(*token), GFP_USER);
> > > +     if (!token)
> > > +             return NULL;
> > > +
> > > +     atomic64_set(&token->refcnt, 1);
> >
> > We should have a LSM hook here to allocate the LSM state associated
> > with the token.
> >
> >   if (security_bpf_token_alloc(token)) {
> >     kvfree(token);
> >     return NULL;
> >   }
> >
> > > +     return token;
> > > +}
> >
> > ...
> >
>
> Would having userns and allowed_* masks filled out by that time inside
> the token be useful (seems so if we treat bpf_token_alloc as generic
> LSM hook). If yes, I'll add security_bpf_token_alloc() after all that
> is filled out, right before we try to get unused fd. WDYT?
>
>
> > > +int bpf_token_create(union bpf_attr *attr)
> > > +{
> > > +     struct bpf_mount_opts *mnt_opts;
> > > +     struct bpf_token *token = NULL;
> > > +     struct inode *inode;
> > > +     struct file *file;
> > > +     struct path path;
> > > +     umode_t mode;
> > > +     int err, fd;
> > > +
> > > +     err = user_path_at(attr->token_create.bpffs_path_fd,
> > > +                        u64_to_user_ptr(attr->token_create.bpffs_pathname),
> > > +                        LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     if (path.mnt->mnt_root != path.dentry) {
> > > +             err = -EINVAL;
> > > +             goto out_path;
> > > +     }
> > > +     err = path_permission(&path, MAY_ACCESS);
> > > +     if (err)
> > > +             goto out_path;
> > > +
> > > +     mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
> > > +     inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
> > > +     if (IS_ERR(inode)) {
> > > +             err = PTR_ERR(inode);
> > > +             goto out_path;
> > > +     }
> > > +
> > > +     inode->i_op = &bpf_token_iops;
> > > +     inode->i_fop = &bpf_token_fops;
> > > +     clear_nlink(inode); /* make sure it is unlinked */
> > > +
> > > +     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > > +     if (IS_ERR(file)) {
> > > +             iput(inode);
> > > +             err = PTR_ERR(file);
> > > +             goto out_file;
> > > +     }
> > > +
> > > +     token = bpf_token_alloc();
> > > +     if (!token) {
> > > +             err = -ENOMEM;
> > > +             goto out_file;
> > > +     }
> > > +
> > > +     /* remember bpffs owning userns for future ns_capable() checks */
> > > +     token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> > > +
> > > +     mnt_opts = path.dentry->d_sb->s_fs_info;
> > > +     token->allowed_cmds = mnt_opts->delegate_cmds;
> >
> > I think we would want a LSM hook here, both to control the creation
> > of the token and mark it with the security attributes of the creating
> > process.  How about something like this:
> >
> >   err = security_bpf_token_create(token);
> >   if (err)
> >     goto out_token;
>
> hmm... so you'd like both security_bpf_token_alloc() and
> security_bpf_token_create()? They seem almost identical, do we need
> two? Or is it that the security_bpf_token_alloc() is supposed to be
> only used to create those `void *security` context pieces, while
> security_bpf_token_create() is actually going to be used for
> enforcement? For my own education, is there some explicit flag or some
> other sort of mark between LSM hooks for setting up security vs
> enforcement? Or is it mostly based on convention and implicitly
> following the split?
>
> >
> > > +     fd = get_unused_fd_flags(O_CLOEXEC);
> > > +     if (fd < 0) {
> > > +             err = fd;
> > > +             goto out_token;
> > > +     }
> > > +
> > > +     file->private_data = token;
> > > +     fd_install(fd, file);
> > > +
> > > +     path_put(&path);
> > > +     return fd;
> > > +
> > > +out_token:
> > > +     bpf_token_free(token);
> > > +out_file:
> > > +     fput(file);
> > > +out_path:
> > > +     path_put(&path);
> > > +     return err;
> > > +}
> >
> > ...
> >
> > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> > > +{
> > > +     if (!token)
> > > +             return false;
> > > +
> > > +     return token->allowed_cmds & (1ULL << cmd);
> >
> > Similar to bpf_token_capable(), I believe we want a LSM hook here to
> > control who is allowed to use the delegated privilege.
> >
> >   bool bpf_token_allow_cmd(...)
> >   {
> >     if (token && (token->allowed_cmds & (1ULL << cmd))
> >       return security_bpf_token_cmd(token, cmd);
>
> ok, so I guess I'll have to add all four variants:
> security_bpf_token_{cmd,map_type,prog_type,attach_type}, right?
>

Thinking a bit more about this, I think this is unnecessary. All these
allow checks to control other BPF commands (BPF map creation, BPF
program load, bpf() syscall command, etc). We have dedicated LSM hooks
for each such operation, most importantly security_bpf_prog_load() and
security_bpf_map_create(). I'm extending both of those to be
token-aware, and struct bpf_token is one of the input arguments, so if
LSM need to override BPF token allow_* checks, they can do in
respective more specialized hooks.

Adding so many token hooks, one for each different allow mask (or any
other sort of "allow something" parameter) seems to be excessive. It
will both add too many super-detailed LSM hooks and will unnecessarily
tie BPF token implementation details to LSM hook implementations, IMO.
I'll send v7 with just security_bpf_token_{create,free}(), please take
a look and let me know if you are still not convinced.

>
>
> >     return false;
> >   }
> >
> > > +}
> >
> > --
> > paul-moore.com
Paul Moore Oct. 12, 2023, 11:18 p.m. UTC | #6
On Wed, Oct 11, 2023 at 8:31 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Tue, Oct 10, 2023 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Sep 27, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Add new kind of BPF kernel object, BPF token. BPF token is meant to
> > > allow delegating privileged BPF functionality, like loading a BPF
> > > program or creating a BPF map, from privileged process to a *trusted*
> > > unprivileged process, all while have a good amount of control over which
> > > privileged operations could be performed using provided BPF token.
> > >
> > > This is achieved through mounting BPF FS instance with extra delegation
> > > mount options, which determine what operations are delegatable, and also
> > > constraining it to the owning user namespace (as mentioned in the
> > > previous patch).
> > >
> > > BPF token itself is just a derivative from BPF FS and can be created
> > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> > > a path specification (using the usual fd + string path combo) to a BPF
> > > FS mount. Currently, BPF token "inherits" delegated command, map types,
> > > prog type, and attach type bit sets from BPF FS as is. In the future,
> > > having an BPF token as a separate object with its own FD, we can allow
> > > to further restrict BPF token's allowable set of things either at the creation
> > > time or after the fact, allowing the process to guard itself further
> > > from, e.g., unintentionally trying to load undesired kind of BPF
> > > programs. But for now we keep things simple and just copy bit sets as is.
> > >
> > > When BPF token is created from BPF FS mount, we take reference to the
> > > BPF super block's owning user namespace, and then use that namespace for
> > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> > > capabilities that are normally only checked against init userns (using
> > > capable()), but now we check them using ns_capable() instead (if BPF
> > > token is provided). See bpf_token_capable() for details.
> > >
> > > Such setup means that BPF token in itself is not sufficient to grant BPF
> > > functionality. User namespaced process has to *also* have necessary
> > > combination of capabilities inside that user namespace. So while
> > > previously CAP_BPF was useless when granted within user namespace, now
> > > it gains a meaning and allows container managers and sys admins to have
> > > a flexible control over which processes can and need to use BPF
> > > functionality within the user namespace (i.e., container in practice).
> > > And BPF FS delegation mount options and derived BPF tokens serve as
> > > a per-container "flag" to grant overall ability to use bpf() (plus further
> > > restrict on which parts of bpf() syscalls are treated as namespaced).
> > >
> > > The alternative to creating BPF token object was:
> > >   a) not having any extra object and just pasing BPF FS path to each
> > >      relevant bpf() command. This seems suboptimal as it's racy (mount
> > >      under the same path might change in between checking it and using it
> > >      for bpf() command). And also less flexible if we'd like to further
> > >      restrict ourselves compared to all the delegated functionality
> > >      allowed on BPF FS.
> > >   b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
> > >      a dedicated FD that would represent a token-like functionality. This
> > >      doesn't seem superior to having a proper bpf() command, so
> > >      BPF_TOKEN_CREATE was chosen.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/bpf.h            |  40 +++++++
> > >  include/uapi/linux/bpf.h       |  39 +++++++
> > >  kernel/bpf/Makefile            |   2 +-
> > >  kernel/bpf/inode.c             |  10 +-
> > >  kernel/bpf/syscall.c           |  17 +++
> > >  kernel/bpf/token.c             | 197 +++++++++++++++++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h |  39 +++++++
> > >  7 files changed, 339 insertions(+), 5 deletions(-)
> > >  create mode 100644 kernel/bpf/token.c
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index a5bd40f71fd0..c43131a24579 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1572,6 +1576,13 @@ struct bpf_mount_opts {
> > >       u64 delegate_attachs;
> > >  };
> > >
> > > +struct bpf_token {
> > > +     struct work_struct work;
> > > +     atomic64_t refcnt;
> > > +     struct user_namespace *userns;
> > > +     u64 allowed_cmds;
> >
> > We'll also need a 'void *security' field to go along with the BPF token
> > allocation/creation/free hooks, see my comments below.  This is similar
> > to what we do for other kernel objects.
>
> ok, I'm thinking of adding a dedicated patch for all the
> security-related stuff and refactoring of existing LSM hook(s).

No objection here.  My main concern is that we get the LSM stuff in
the same patchset as the rest of the BPF token patches; if all of the
LSM bits are in a separate patch I'm not bothered.

Once we settle on the LSM hooks I'll draft a SELinux implementation of
the hooks which I'll hand off to you for inclusion in the patchset as
well.  I'd encourage the other LSMs that are interested to do the
same.

> > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > +{
> > > +     /* BPF token allows ns_capable() level of capabilities */
> > > +     if (token) {
> >
> > I think we want a LSM hook here before the token is used in the
> > capability check.  The LSM will see the capability check, but it will
> > not be able to distinguish it from the process which created the
> > delegation token.  This is arguably the purpose of the delegation, but
> > with the LSM we want to be able to control who can use the delegated
> > privilege.  How about something like this:
> >
> >   if (security_bpf_token_capable(token, cap))
> >      return false;
>
> sounds good, I'll add this hook
>
> btw, I'm thinking of guarding the BPF_TOKEN_CREATE command behind the
> ns_capable(CAP_BPF) check, WDYT? This seems appropriate. You can get
> BPF token only if you have CAP_BPF **within the userns**, so any
> process not granted CAP_BPF within namespace ("container") is
> guaranteed to not be able to do anything with BPF token.

I don't recall seeing any capability checks guarding BPF_TOKEN_CREATE
in this revision so I don't think you're weakening the restrictions
any, and the logic above seems reasonable: if you don't have CAP_BPF
you shouldn't be creating a token.

> > > +static struct bpf_token *bpf_token_alloc(void)
> > > +{
> > > +     struct bpf_token *token;
> > > +
> > > +     token = kvzalloc(sizeof(*token), GFP_USER);
> > > +     if (!token)
> > > +             return NULL;
> > > +
> > > +     atomic64_set(&token->refcnt, 1);
> >
> > We should have a LSM hook here to allocate the LSM state associated
> > with the token.
> >
> >   if (security_bpf_token_alloc(token)) {
> >     kvfree(token);
> >     return NULL;
> >   }
> >
> > > +     return token;
> > > +}
> >
> > ...
> >
>
> Would having userns and allowed_* masks filled out by that time inside
> the token be useful (seems so if we treat bpf_token_alloc as generic
> LSM hook). If yes, I'll add security_bpf_token_alloc() after all that
> is filled out, right before we try to get unused fd. WDYT?

The security_bpf_token_alloc() hook isn't intended to do any access
control, it's just there so that the LSMs which need to allocate state
for the token object can do so at the same time the token is
allocated.  It has been my experience that allocating and releasing
the LSM state at the same time as the primary object's state is much
less fragile than disconnecting the two lifetimes and allocating the
LSM state later.

> > > +int bpf_token_create(union bpf_attr *attr)
> > > +{
> > > +     struct bpf_mount_opts *mnt_opts;
> > > +     struct bpf_token *token = NULL;
> > > +     struct inode *inode;
> > > +     struct file *file;
> > > +     struct path path;
> > > +     umode_t mode;
> > > +     int err, fd;
> > > +
> > > +     err = user_path_at(attr->token_create.bpffs_path_fd,
> > > +                        u64_to_user_ptr(attr->token_create.bpffs_pathname),
> > > +                        LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     if (path.mnt->mnt_root != path.dentry) {
> > > +             err = -EINVAL;
> > > +             goto out_path;
> > > +     }
> > > +     err = path_permission(&path, MAY_ACCESS);
> > > +     if (err)
> > > +             goto out_path;
> > > +
> > > +     mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
> > > +     inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
> > > +     if (IS_ERR(inode)) {
> > > +             err = PTR_ERR(inode);
> > > +             goto out_path;
> > > +     }
> > > +
> > > +     inode->i_op = &bpf_token_iops;
> > > +     inode->i_fop = &bpf_token_fops;
> > > +     clear_nlink(inode); /* make sure it is unlinked */
> > > +
> > > +     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > > +     if (IS_ERR(file)) {
> > > +             iput(inode);
> > > +             err = PTR_ERR(file);
> > > +             goto out_file;
> > > +     }
> > > +
> > > +     token = bpf_token_alloc();
> > > +     if (!token) {
> > > +             err = -ENOMEM;
> > > +             goto out_file;
> > > +     }
> > > +
> > > +     /* remember bpffs owning userns for future ns_capable() checks */
> > > +     token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> > > +
> > > +     mnt_opts = path.dentry->d_sb->s_fs_info;
> > > +     token->allowed_cmds = mnt_opts->delegate_cmds;
> >
> > I think we would want a LSM hook here, both to control the creation
> > of the token and mark it with the security attributes of the creating
> > process.  How about something like this:
> >
> >   err = security_bpf_token_create(token);
> >   if (err)
> >     goto out_token;
>
> hmm... so you'd like both security_bpf_token_alloc() and
> security_bpf_token_create()? They seem almost identical, do we need
> two? Or is it that the security_bpf_token_alloc() is supposed to be
> only used to create those `void *security` context pieces, while
> security_bpf_token_create() is actually going to be used for
> enforcement?

I tried to explain a bit in my comment above, but the alloc hook
basically just does the LSM state allocation whereas the token_create
hook does the access control and setting of the LSM state associated
with the token.

If you want to get rid of the bpf_token_alloc() function and fold it
into the bpf_token_create() function then we can go down to one LSM
hook that covers state allocation, initialization, and access control.
However, if it remains possible to allocate a token object outside of
bpf_token_create() I think it is a good idea to keep the dedicated LSM
allocation hooks.

You can apply the same logic to the LSM token state destructor hook.

> For my own education, is there some explicit flag or some
> other sort of mark between LSM hooks for setting up security vs
> enforcement? Or is it mostly based on convention and implicitly
> following the split?

Generally convention based around trying to match the LSM state
lifetime with the lifetime of the associated object; as I mentioned
earlier, we've had problems in the past when the two differ.  If you
look at all of the LSM hooks you'll see a number of
"security_XXXX_alloc()" hooks for things like superblocks, inodes,
files, task structs, creds, and the like; we're just doing the same
things here with BPF tokens.  If you're still not convinced, it may be
worth noting that we currently have security_bpf_map_alloc() and
security_bpf_prog_alloc() hooks.

> > > +     fd = get_unused_fd_flags(O_CLOEXEC);
> > > +     if (fd < 0) {
> > > +             err = fd;
> > > +             goto out_token;
> > > +     }
> > > +
> > > +     file->private_data = token;
> > > +     fd_install(fd, file);
> > > +
> > > +     path_put(&path);
> > > +     return fd;
> > > +
> > > +out_token:
> > > +     bpf_token_free(token);
> > > +out_file:
> > > +     fput(file);
> > > +out_path:
> > > +     path_put(&path);
> > > +     return err;
> > > +}
> >
> > ...
> >
> > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> > > +{
> > > +     if (!token)
> > > +             return false;
> > > +
> > > +     return token->allowed_cmds & (1ULL << cmd);
> >
> > Similar to bpf_token_capable(), I believe we want a LSM hook here to
> > control who is allowed to use the delegated privilege.
> >
> >   bool bpf_token_allow_cmd(...)
> >   {
> >     if (token && (token->allowed_cmds & (1ULL << cmd))
> >       return security_bpf_token_cmd(token, cmd);
>
> ok, so I guess I'll have to add all four variants:
> security_bpf_token_{cmd,map_type,prog_type,attach_type}, right?

Not necessarily.  Currently only SELinux provides a set of LSM BPF
access controls, and it only concerns itself with BPF maps and
programs.  From a map perspective it comes down to controlling which
applications can create a map, or use a map created elsewhere (we
label BPF map objects just as we would any other kernel object).  From
a program perspective, it is about loading programs and executing
them.  While not BPF specific, SELinux also provides controls that
restrict the ability to exercise capabilities.

Keeping the above access controls in mind, I'm hopeful you can better
understand the reasoning behind the hooks I'm proposing.  The
security_bpf_token_cmd() hook is there so that we can control a
token's ability to authorize BPF_MAP_CREATE and BPF_PROG_LOAD.  The
security_bpf_token_capable() hook is there so that we can control a
token's ability to authorize the BPF-related capabilities.  While I
guess it's possible someone might develop a security model for BPF
that would require the other LSM hooks you've mentioned above, but I
see no need for that now, and to be honest I don't see any need on the
visible horizon either.

Does that make sense?

--
paul-moore.com
Paul Moore Oct. 12, 2023, 11:43 p.m. UTC | #7
On Thu, Oct 12, 2023 at 5:48 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Wed, Oct 11, 2023 at 5:31 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > ok, so I guess I'll have to add all four variants:
> > security_bpf_token_{cmd,map_type,prog_type,attach_type}, right?
> >
>
> Thinking a bit more about this, I think this is unnecessary. All these
> allow checks to control other BPF commands (BPF map creation, BPF
> program load, bpf() syscall command, etc). We have dedicated LSM hooks
> for each such operation, most importantly security_bpf_prog_load() and
> security_bpf_map_create(). I'm extending both of those to be
> token-aware, and struct bpf_token is one of the input arguments, so if
> LSM need to override BPF token allow_* checks, they can do in
> respective more specialized hooks.
>
> Adding so many token hooks, one for each different allow mask (or any
> other sort of "allow something" parameter) seems to be excessive. It
> will both add too many super-detailed LSM hooks and will unnecessarily
> tie BPF token implementation details to LSM hook implementations, IMO.
> I'll send v7 with just security_bpf_token_{create,free}(), please take
> a look and let me know if you are still not convinced.

I'm hoping my last email better explains why we only really need
security_bpf_token_cmd() and security_bpf_token_capable() as opposed
to the full list of security_bpf_token_XXX().  If not, please let me
know and I'll try to do a better job explaining my reasoning :)

One thing I didn't discuss in my last email was why there is value in
having both security_bpf_token_{cmd,capable}() as well as
security_bpf_prog_load(); I'll try to do that below.

As we talked about previously, the reason for having
security_bpf_prog_load() is to provide a token-aware version of
security_bpf().  Currently the LSMs enforce their access controls
around BPF commands using the security_bpf() hook which is
unfortunately well before we have access to the BPF token.  If we want
to be able to take the BPF token into account we need to have a hook
placed after the token is retrieved and validated, hence the
security_bpf_prog_load() hook.  In a kernel that provides BPF tokens,
I would expect that LSMs would use security_bpf() to control access to
BPF operations where a token is not a concern, and new token-aware
security_bpf_OPERATION() hooks when the LSM needs to consider the BPF
token.

With the understanding that security_bpf_prog_load() is essentially a
token-aware version of security_bpf(), I'm hopeful that you can begin
to understand that it  serves a different purpose than
security_bpf_token_{cmd,capable}().  The simple answer is that
security_bpf_token_cmd() applies to more than just BPF_PROG_LOAD, but
the better answer is that it has the ability to impact more than just
the LSM authorization decision.  Hooking the LSM into the
bpf_token_allow_cmd() function allows the LSM to authorize the
individual command overrides independent of the command specific LSM
hook, if one exists.  The security_bpf_token_cmd() hook can allow or
disallow the use of a token for all aspects of a specific BPF
operation including all of the token related logic outside of the LSM,
something the security_bpf_prog_load() hook could never do.

I'm hoping that makes sense :)
Andrii Nakryiko Oct. 12, 2023, 11:45 p.m. UTC | #8
On Thu, Oct 12, 2023 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Wed, Oct 11, 2023 at 8:31 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Tue, Oct 10, 2023 at 6:17 PM Paul Moore <paul@paul-moore.com> wrote:
> > > On Sep 27, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to
> > > > allow delegating privileged BPF functionality, like loading a BPF
> > > > program or creating a BPF map, from privileged process to a *trusted*
> > > > unprivileged process, all while have a good amount of control over which
> > > > privileged operations could be performed using provided BPF token.
> > > >
> > > > This is achieved through mounting BPF FS instance with extra delegation
> > > > mount options, which determine what operations are delegatable, and also
> > > > constraining it to the owning user namespace (as mentioned in the
> > > > previous patch).
> > > >
> > > > BPF token itself is just a derivative from BPF FS and can be created
> > > > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> > > > a path specification (using the usual fd + string path combo) to a BPF
> > > > FS mount. Currently, BPF token "inherits" delegated command, map types,
> > > > prog type, and attach type bit sets from BPF FS as is. In the future,
> > > > having an BPF token as a separate object with its own FD, we can allow
> > > > to further restrict BPF token's allowable set of things either at the creation
> > > > time or after the fact, allowing the process to guard itself further
> > > > from, e.g., unintentionally trying to load undesired kind of BPF
> > > > programs. But for now we keep things simple and just copy bit sets as is.
> > > >
> > > > When BPF token is created from BPF FS mount, we take reference to the
> > > > BPF super block's owning user namespace, and then use that namespace for
> > > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> > > > capabilities that are normally only checked against init userns (using
> > > > capable()), but now we check them using ns_capable() instead (if BPF
> > > > token is provided). See bpf_token_capable() for details.
> > > >
> > > > Such setup means that BPF token in itself is not sufficient to grant BPF
> > > > functionality. User namespaced process has to *also* have necessary
> > > > combination of capabilities inside that user namespace. So while
> > > > previously CAP_BPF was useless when granted within user namespace, now
> > > > it gains a meaning and allows container managers and sys admins to have
> > > > a flexible control over which processes can and need to use BPF
> > > > functionality within the user namespace (i.e., container in practice).
> > > > And BPF FS delegation mount options and derived BPF tokens serve as
> > > > a per-container "flag" to grant overall ability to use bpf() (plus further
> > > > restrict on which parts of bpf() syscalls are treated as namespaced).
> > > >
> > > > The alternative to creating BPF token object was:
> > > >   a) not having any extra object and just pasing BPF FS path to each
> > > >      relevant bpf() command. This seems suboptimal as it's racy (mount
> > > >      under the same path might change in between checking it and using it
> > > >      for bpf() command). And also less flexible if we'd like to further
> > > >      restrict ourselves compared to all the delegated functionality
> > > >      allowed on BPF FS.
> > > >   b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
> > > >      a dedicated FD that would represent a token-like functionality. This
> > > >      doesn't seem superior to having a proper bpf() command, so
> > > >      BPF_TOKEN_CREATE was chosen.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  include/linux/bpf.h            |  40 +++++++
> > > >  include/uapi/linux/bpf.h       |  39 +++++++
> > > >  kernel/bpf/Makefile            |   2 +-
> > > >  kernel/bpf/inode.c             |  10 +-
> > > >  kernel/bpf/syscall.c           |  17 +++
> > > >  kernel/bpf/token.c             | 197 +++++++++++++++++++++++++++++++++
> > > >  tools/include/uapi/linux/bpf.h |  39 +++++++
> > > >  7 files changed, 339 insertions(+), 5 deletions(-)
> > > >  create mode 100644 kernel/bpf/token.c
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index a5bd40f71fd0..c43131a24579 100644
> > > > --- a/include/linux/bpf.h
> > > > +++ b/include/linux/bpf.h
> > > > @@ -1572,6 +1576,13 @@ struct bpf_mount_opts {
> > > >       u64 delegate_attachs;
> > > >  };
> > > >
> > > > +struct bpf_token {
> > > > +     struct work_struct work;
> > > > +     atomic64_t refcnt;
> > > > +     struct user_namespace *userns;
> > > > +     u64 allowed_cmds;
> > >
> > > We'll also need a 'void *security' field to go along with the BPF token
> > > allocation/creation/free hooks, see my comments below.  This is similar
> > > to what we do for other kernel objects.
> >
> > ok, I'm thinking of adding a dedicated patch for all the
> > security-related stuff and refactoring of existing LSM hook(s).
>
> No objection here.  My main concern is that we get the LSM stuff in
> the same patchset as the rest of the BPF token patches; if all of the
> LSM bits are in a separate patch I'm not bothered.
>
> Once we settle on the LSM hooks I'll draft a SELinux implementation of
> the hooks which I'll hand off to you for inclusion in the patchset as
> well.  I'd encourage the other LSMs that are interested to do the
> same.

I posted v7 ([0]) a few minutes ago, please take a look when you get a
chance. All the LSM-related stuff is in a few separate patches.

  [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=792765&state=*

>
> > > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > +{
> > > > +     /* BPF token allows ns_capable() level of capabilities */
> > > > +     if (token) {
> > >
> > > I think we want a LSM hook here before the token is used in the
> > > capability check.  The LSM will see the capability check, but it will
> > > not be able to distinguish it from the process which created the
> > > delegation token.  This is arguably the purpose of the delegation, but
> > > with the LSM we want to be able to control who can use the delegated
> > > privilege.  How about something like this:
> > >
> > >   if (security_bpf_token_capable(token, cap))
> > >      return false;
> >
> > sounds good, I'll add this hook
> >
> > btw, I'm thinking of guarding the BPF_TOKEN_CREATE command behind the
> > ns_capable(CAP_BPF) check, WDYT? This seems appropriate. You can get
> > BPF token only if you have CAP_BPF **within the userns**, so any
> > process not granted CAP_BPF within namespace ("container") is
> > guaranteed to not be able to do anything with BPF token.
>
> I don't recall seeing any capability checks guarding BPF_TOKEN_CREATE
> in this revision so I don't think you're weakening the restrictions
> any, and the logic above seems reasonable: if you don't have CAP_BPF
> you shouldn't be creating a token.

Right, there was none based on the assumption that you have to have an
access to BPF FS controlled through other means. But it does seem
right to use ns_capable(CAP_BPF) in addition to all that, so that's
what I did in v7.

>
> > > > +static struct bpf_token *bpf_token_alloc(void)
> > > > +{
> > > > +     struct bpf_token *token;
> > > > +
> > > > +     token = kvzalloc(sizeof(*token), GFP_USER);
> > > > +     if (!token)
> > > > +             return NULL;
> > > > +
> > > > +     atomic64_set(&token->refcnt, 1);
> > >
> > > We should have a LSM hook here to allocate the LSM state associated
> > > with the token.
> > >
> > >   if (security_bpf_token_alloc(token)) {
> > >     kvfree(token);
> > >     return NULL;
> > >   }
> > >
> > > > +     return token;
> > > > +}
> > >
> > > ...
> > >
> >
> > Would having userns and allowed_* masks filled out by that time inside
> > the token be useful (seems so if we treat bpf_token_alloc as generic
> > LSM hook). If yes, I'll add security_bpf_token_alloc() after all that
> > is filled out, right before we try to get unused fd. WDYT?
>
> The security_bpf_token_alloc() hook isn't intended to do any access
> control, it's just there so that the LSMs which need to allocate state
> for the token object can do so at the same time the token is
> allocated.  It has been my experience that allocating and releasing
> the LSM state at the same time as the primary object's state is much
> less fragile than disconnecting the two lifetimes and allocating the
> LSM state later.

You'll be able to see what I did in v7, but I don't think separate
lifetimes are a concern, because all these alloc/free hooks are called
just as bpf_{prog,map,token} are created or right when they are freed,
so in that regard everything is good.

But based on our discussion to rework bpf_prog_alloc/bpf_map_alloc
into bpf_prog_load/bpf_map_create, it would be inconsistent between
prog/map and token. Anyways, the code is out, take a look. If you hate
it, it's just a matter of adding one extra LSM hook for token, map,
and prog each.

>
> > > > +int bpf_token_create(union bpf_attr *attr)
> > > > +{
> > > > +     struct bpf_mount_opts *mnt_opts;
> > > > +     struct bpf_token *token = NULL;
> > > > +     struct inode *inode;
> > > > +     struct file *file;
> > > > +     struct path path;
> > > > +     umode_t mode;
> > > > +     int err, fd;
> > > > +
> > > > +     err = user_path_at(attr->token_create.bpffs_path_fd,
> > > > +                        u64_to_user_ptr(attr->token_create.bpffs_pathname),
> > > > +                        LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > > +     if (path.mnt->mnt_root != path.dentry) {
> > > > +             err = -EINVAL;
> > > > +             goto out_path;
> > > > +     }
> > > > +     err = path_permission(&path, MAY_ACCESS);
> > > > +     if (err)
> > > > +             goto out_path;
> > > > +
> > > > +     mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
> > > > +     inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
> > > > +     if (IS_ERR(inode)) {
> > > > +             err = PTR_ERR(inode);
> > > > +             goto out_path;
> > > > +     }
> > > > +
> > > > +     inode->i_op = &bpf_token_iops;
> > > > +     inode->i_fop = &bpf_token_fops;
> > > > +     clear_nlink(inode); /* make sure it is unlinked */
> > > > +
> > > > +     file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
> > > > +     if (IS_ERR(file)) {
> > > > +             iput(inode);
> > > > +             err = PTR_ERR(file);
> > > > +             goto out_file;
> > > > +     }
> > > > +
> > > > +     token = bpf_token_alloc();
> > > > +     if (!token) {
> > > > +             err = -ENOMEM;
> > > > +             goto out_file;
> > > > +     }
> > > > +
> > > > +     /* remember bpffs owning userns for future ns_capable() checks */
> > > > +     token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> > > > +
> > > > +     mnt_opts = path.dentry->d_sb->s_fs_info;
> > > > +     token->allowed_cmds = mnt_opts->delegate_cmds;
> > >
> > > I think we would want a LSM hook here, both to control the creation
> > > of the token and mark it with the security attributes of the creating
> > > process.  How about something like this:
> > >
> > >   err = security_bpf_token_create(token);
> > >   if (err)
> > >     goto out_token;
> >
> > hmm... so you'd like both security_bpf_token_alloc() and
> > security_bpf_token_create()? They seem almost identical, do we need
> > two? Or is it that the security_bpf_token_alloc() is supposed to be
> > only used to create those `void *security` context pieces, while
> > security_bpf_token_create() is actually going to be used for
> > enforcement?
>
> I tried to explain a bit in my comment above, but the alloc hook
> basically just does the LSM state allocation whereas the token_create
> hook does the access control and setting of the LSM state associated
> with the token.
>
> If you want to get rid of the bpf_token_alloc() function and fold it
> into the bpf_token_create() function then we can go down to one LSM
> hook that covers state allocation, initialization, and access control.
> However, if it remains possible to allocate a token object outside of
> bpf_token_create() I think it is a good idea to keep the dedicated LSM
> allocation hooks.
>
> You can apply the same logic to the LSM token state destructor hook.

Yes, so that's what I went with: combining allocation and access
control. And no, there is no way to get a prog/map/token without
having a respective alloc/control hook.

I also called out the fact that we might be having a memory leak when
using multiple LSMs together and them disagreeing on alloc hook
return. You'll see in the v7 what I mean, I mention that in the commit
messages.

>
> > For my own education, is there some explicit flag or some
> > other sort of mark between LSM hooks for setting up security vs
> > enforcement? Or is it mostly based on convention and implicitly
> > following the split?
>
> Generally convention based around trying to match the LSM state
> lifetime with the lifetime of the associated object; as I mentioned
> earlier, we've had problems in the past when the two differ.  If you
> look at all of the LSM hooks you'll see a number of
> "security_XXXX_alloc()" hooks for things like superblocks, inodes,
> files, task structs, creds, and the like; we're just doing the same
> things here with BPF tokens.  If you're still not convinced, it may be
> worth noting that we currently have security_bpf_map_alloc() and
> security_bpf_prog_alloc() hooks.

It's not like I'm convinced or not, I'm trying to keep things
consistent without breaking anything or causing a mess, but also
hopefully not add too many unnecessary hooks. Let's continue on
respective patches in v7, I'm curious to hear your comments based on
the code I posted.

>
> > > > +     fd = get_unused_fd_flags(O_CLOEXEC);
> > > > +     if (fd < 0) {
> > > > +             err = fd;
> > > > +             goto out_token;
> > > > +     }
> > > > +
> > > > +     file->private_data = token;
> > > > +     fd_install(fd, file);
> > > > +
> > > > +     path_put(&path);
> > > > +     return fd;
> > > > +
> > > > +out_token:
> > > > +     bpf_token_free(token);
> > > > +out_file:
> > > > +     fput(file);
> > > > +out_path:
> > > > +     path_put(&path);
> > > > +     return err;
> > > > +}
> > >
> > > ...
> > >
> > > > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> > > > +{
> > > > +     if (!token)
> > > > +             return false;
> > > > +
> > > > +     return token->allowed_cmds & (1ULL << cmd);
> > >
> > > Similar to bpf_token_capable(), I believe we want a LSM hook here to
> > > control who is allowed to use the delegated privilege.
> > >
> > >   bool bpf_token_allow_cmd(...)
> > >   {
> > >     if (token && (token->allowed_cmds & (1ULL << cmd))
> > >       return security_bpf_token_cmd(token, cmd);
> >
> > ok, so I guess I'll have to add all four variants:
> > security_bpf_token_{cmd,map_type,prog_type,attach_type}, right?
>
> Not necessarily.  Currently only SELinux provides a set of LSM BPF
> access controls, and it only concerns itself with BPF maps and
> programs.  From a map perspective it comes down to controlling which
> applications can create a map, or use a map created elsewhere (we
> label BPF map objects just as we would any other kernel object).  From
> a program perspective, it is about loading programs and executing
> them.  While not BPF specific, SELinux also provides controls that
> restrict the ability to exercise capabilities.
>
> Keeping the above access controls in mind, I'm hopeful you can better
> understand the reasoning behind the hooks I'm proposing.  The
> security_bpf_token_cmd() hook is there so that we can control a
> token's ability to authorize BPF_MAP_CREATE and BPF_PROG_LOAD.  The
> security_bpf_token_capable() hook is there so that we can control a
> token's ability to authorize the BPF-related capabilities.  While I
> guess it's possible someone might develop a security model for BPF
> that would require the other LSM hooks you've mentioned above, but I
> see no need for that now, and to be honest I don't see any need on the
> visible horizon either.
>
> Does that make sense?

I'll reply to your second email :)

>
> --
> paul-moore.com
Andrii Nakryiko Oct. 12, 2023, 11:51 p.m. UTC | #9
On Thu, Oct 12, 2023 at 4:43 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Thu, Oct 12, 2023 at 5:48 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Wed, Oct 11, 2023 at 5:31 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > ok, so I guess I'll have to add all four variants:
> > > security_bpf_token_{cmd,map_type,prog_type,attach_type}, right?
> > >
> >
> > Thinking a bit more about this, I think this is unnecessary. All these
> > allow checks to control other BPF commands (BPF map creation, BPF
> > program load, bpf() syscall command, etc). We have dedicated LSM hooks
> > for each such operation, most importantly security_bpf_prog_load() and
> > security_bpf_map_create(). I'm extending both of those to be
> > token-aware, and struct bpf_token is one of the input arguments, so if
> > LSM need to override BPF token allow_* checks, they can do in
> > respective more specialized hooks.
> >
> > Adding so many token hooks, one for each different allow mask (or any
> > other sort of "allow something" parameter) seems to be excessive. It
> > will both add too many super-detailed LSM hooks and will unnecessarily
> > tie BPF token implementation details to LSM hook implementations, IMO.
> > I'll send v7 with just security_bpf_token_{create,free}(), please take
> > a look and let me know if you are still not convinced.
>
> I'm hoping my last email better explains why we only really need
> security_bpf_token_cmd() and security_bpf_token_capable() as opposed
> to the full list of security_bpf_token_XXX().  If not, please let me
> know and I'll try to do a better job explaining my reasoning :)
>
> One thing I didn't discuss in my last email was why there is value in
> having both security_bpf_token_{cmd,capable}() as well as
> security_bpf_prog_load(); I'll try to do that below.
>
> As we talked about previously, the reason for having
> security_bpf_prog_load() is to provide a token-aware version of
> security_bpf().  Currently the LSMs enforce their access controls
> around BPF commands using the security_bpf() hook which is
> unfortunately well before we have access to the BPF token.  If we want
> to be able to take the BPF token into account we need to have a hook
> placed after the token is retrieved and validated, hence the
> security_bpf_prog_load() hook.  In a kernel that provides BPF tokens,
> I would expect that LSMs would use security_bpf() to control access to
> BPF operations where a token is not a concern, and new token-aware
> security_bpf_OPERATION() hooks when the LSM needs to consider the BPF
> token.
>
> With the understanding that security_bpf_prog_load() is essentially a
> token-aware version of security_bpf(), I'm hopeful that you can begin
> to understand that it  serves a different purpose than
> security_bpf_token_{cmd,capable}().  The simple answer is that
> security_bpf_token_cmd() applies to more than just BPF_PROG_LOAD, but
> the better answer is that it has the ability to impact more than just
> the LSM authorization decision.  Hooking the LSM into the
> bpf_token_allow_cmd() function allows the LSM to authorize the
> individual command overrides independent of the command specific LSM
> hook, if one exists.  The security_bpf_token_cmd() hook can allow or
> disallow the use of a token for all aspects of a specific BPF
> operation including all of the token related logic outside of the LSM,
> something the security_bpf_prog_load() hook could never do.
>
> I'm hoping that makes sense :)

Yes, I think I understand what you are trying to do, but I need to
clarify something about the bpf_token_allow_cmd() check. It's
meaningless for any command besides BPF_PROG_LOAD, BPF_MAP_CREATE, and
BPF_BTF_LOAD. For any other command you cannot even specify token_fd.
So even if you create a token allowing, say, BPF_MAP_LOOKUP_ELEM, it
has no effect, because BPF_MAP_LOOKUP_ELEM is doing its own checks
based on the provided BPF map FD.

So only if the command is token-aware itself, this allowed_cmd makes
any difference. And in such a case we'll most probably have and/or
want to have an LSM hook for that specific command that accepts struct
bpf_token as an argument. Which is what I did for
security_bpf_prog_load and security_bpf_map_create.

Granted, we don't have any LSM hooks for BPF_BTF_LOAD, mostly because
BTF is just a blob of type info data, and I guess no one bothered to
control the ability to load that. But we can add that easily, if you
think it's important.

So taking everything you said, I still think we don't want a
bpf_token_capable hook, and we'll just be getting targeted LSM hooks
if we need them for some new or existing BPF commands.

Basically, bpf_token's allow_cmd doesn't give you a bypass for
non-token checks we are already doing. So security_bpf() for
everything besides BPF_PROG_LOAD/BPF_MAP_CREATE/BPF_BTF_LOAD is a
completely valid way to restrict everything. You won't miss anything.

>
> --
> paul-moore.com
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a5bd40f71fd0..c43131a24579 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -51,6 +51,10 @@  struct module;
 struct bpf_func_state;
 struct ftrace_ops;
 struct cgroup;
+struct bpf_token;
+struct user_namespace;
+struct super_block;
+struct inode;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1572,6 +1576,13 @@  struct bpf_mount_opts {
 	u64 delegate_attachs;
 };
 
+struct bpf_token {
+	struct work_struct work;
+	atomic64_t refcnt;
+	struct user_namespace *userns;
+	u64 allowed_cmds;
+};
+
 struct bpf_struct_ops_value;
 struct btf_member;
 
@@ -2162,6 +2173,8 @@  static inline void bpf_map_dec_elem_count(struct bpf_map *map)
 
 extern int sysctl_unprivileged_bpf_disabled;
 
+bool bpf_token_capable(const struct bpf_token *token, int cap);
+
 static inline bool bpf_allow_ptr_leaks(void)
 {
 	return perfmon_capable();
@@ -2196,8 +2209,17 @@  int bpf_link_new_fd(struct bpf_link *link);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
 
+void bpf_token_inc(struct bpf_token *token);
+void bpf_token_put(struct bpf_token *token);
+int bpf_token_create(union bpf_attr *attr);
+struct bpf_token *bpf_token_get_from_fd(u32 ufd);
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
+
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
+struct inode *bpf_get_inode(struct super_block *sb, const struct inode *dir,
+			    umode_t mode);
 
 #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
 #define DEFINE_BPF_ITER_FUNC(target, args...)			\
@@ -2557,6 +2579,24 @@  static inline int bpf_obj_get_user(const char __user *pathname, int flags)
 	return -EOPNOTSUPP;
 }
 
+static inline bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}
+
+static inline void bpf_token_inc(struct bpf_token *token)
+{
+}
+
+static inline void bpf_token_put(struct bpf_token *token)
+{
+}
+
+static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline void __dev_flush(void)
 {
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 70bfa997e896..78692911f4a0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -847,6 +847,37 @@  union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality it allows:
+ *		- a set of allowed bpf() syscall commands;
+ *		- a set of allowed BPF map types to be created with
+ *		BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
+ *		- a set of allowed BPF program types and BPF program attach
+ *		types to be loaded with BPF_PROG_LOAD command, if
+ *		BPF_PROG_LOAD itself is allowed.
+ *
+ *		BPF token is created (derived) from an instance of BPF FS,
+ *		assuming it has necessary delegation mount options specified.
+ *		BPF FS mount is specified with openat()-style path FD + string.
+ *		This BPF token can be passed as an extra parameter to various
+ *		bpf() syscall commands to grant BPF subsystem functionality to
+ *		unprivileged processes.
+ *
+ *		When created, BPF token is "associated" with the owning
+ *		user namespace of BPF FS instance (super block) that it was
+ *		derived from, and subsequent BPF operations performed with
+ *		BPF token would be performing capabilities checks (i.e.,
+ *		CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
+ *		that user namespace. Without BPF token, such capabilities
+ *		have to be granted in init user namespace, making bpf()
+ *		syscall incompatible with user namespace, for the most part.
+ *
+ *	Return
+ *		A new file descriptor (a nonnegative integer), or -1 if an
+ *		error occurred (in which case, *errno* is set appropriately).
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -901,6 +932,8 @@  enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
+	__MAX_BPF_CMD,
 };
 
 enum bpf_map_type {
@@ -1694,6 +1727,12 @@  union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		__u32		flags;
+		__u32		bpffs_path_fd;
+		__u64		bpffs_pathname;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f526b7573e97..4ce95acfcaa7 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@  cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
 endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 24b3faf901f4..de1fdf396521 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -99,9 +99,9 @@  static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };
 static const struct inode_operations bpf_link_iops  = { };
 
-static struct inode *bpf_get_inode(struct super_block *sb,
-				   const struct inode *dir,
-				   umode_t mode)
+struct inode *bpf_get_inode(struct super_block *sb,
+			    const struct inode *dir,
+			    umode_t mode)
 {
 	struct inode *inode;
 
@@ -603,11 +603,13 @@  static int bpf_show_options(struct seq_file *m, struct dentry *root)
 {
 	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
 	umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
+	u64 mask;
 
 	if (mode != S_IRWXUGO)
 		seq_printf(m, ",mode=%o", mode);
 
-	if (opts->delegate_cmds == ~0ULL)
+	mask = (1ULL << __MAX_BPF_CMD) - 1;
+	if ((opts->delegate_cmds & mask) == mask)
 		seq_printf(m, ",delegate_cmds=any");
 	else if (opts->delegate_cmds)
 		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7445dad01fb3..b47791a80930 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5304,6 +5304,20 @@  static int bpf_prog_bind_map(union bpf_attr *attr)
 	return ret;
 }
 
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname
+
+static int token_create(union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_TOKEN_CREATE))
+		return -EINVAL;
+
+	/* no flags are supported yet */
+	if (attr->token_create.flags)
+		return -EINVAL;
+
+	return bpf_token_create(attr);
+}
+
 static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
@@ -5437,6 +5451,9 @@  static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 	case BPF_PROG_BIND_MAP:
 		err = bpf_prog_bind_map(&attr);
 		break;
+	case BPF_TOKEN_CREATE:
+		err = token_create(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
new file mode 100644
index 000000000000..779aad5007a3
--- /dev/null
+++ b/kernel/bpf/token.c
@@ -0,0 +1,197 @@ 
+#include <linux/bpf.h>
+#include <linux/vmalloc.h>
+#include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/idr.h>
+#include <linux/namei.h>
+#include <linux/user_namespace.h>
+
+bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+	/* BPF token allows ns_capable() level of capabilities */
+	if (token) {
+		if (ns_capable(token->userns, cap))
+			return true;
+		if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
+			return true;
+	}
+	/* otherwise fallback to capable() checks */
+	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}
+
+void bpf_token_inc(struct bpf_token *token)
+{
+	atomic64_inc(&token->refcnt);
+}
+
+static void bpf_token_free(struct bpf_token *token)
+{
+	put_user_ns(token->userns);
+	kvfree(token);
+}
+
+static void bpf_token_put_deferred(struct work_struct *work)
+{
+	struct bpf_token *token = container_of(work, struct bpf_token, work);
+
+	bpf_token_free(token);
+}
+
+void bpf_token_put(struct bpf_token *token)
+{
+	if (!token)
+		return;
+
+	if (!atomic64_dec_and_test(&token->refcnt))
+		return;
+
+	INIT_WORK(&token->work, bpf_token_put_deferred);
+	schedule_work(&token->work);
+}
+
+static int bpf_token_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_token *token = filp->private_data;
+
+	bpf_token_put(token);
+	return 0;
+}
+
+static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
+{
+	struct bpf_token *token = filp->private_data;
+	u64 mask;
+
+	mask = (1ULL << __MAX_BPF_CMD) - 1;
+	if ((token->allowed_cmds & mask) == mask)
+		seq_printf(m, "allowed_cmds:\tany\n");
+	else
+		seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds);
+}
+
+static struct bpf_token *bpf_token_alloc(void)
+{
+	struct bpf_token *token;
+
+	token = kvzalloc(sizeof(*token), GFP_USER);
+	if (!token)
+		return NULL;
+
+	atomic64_set(&token->refcnt, 1);
+
+	return token;
+}
+
+#define BPF_TOKEN_INODE_NAME "bpf-token"
+
+static const struct inode_operations bpf_token_iops = { };
+
+static const struct file_operations bpf_token_fops = {
+	.release	= bpf_token_release,
+	.show_fdinfo	= bpf_token_show_fdinfo,
+};
+
+int bpf_token_create(union bpf_attr *attr)
+{
+	struct bpf_mount_opts *mnt_opts;
+	struct bpf_token *token = NULL;
+	struct inode *inode;
+	struct file *file;
+	struct path path;
+	umode_t mode;
+	int err, fd;
+
+	err = user_path_at(attr->token_create.bpffs_path_fd,
+			   u64_to_user_ptr(attr->token_create.bpffs_pathname),
+			   LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
+	if (err)
+		return err;
+
+	if (path.mnt->mnt_root != path.dentry) {
+		err = -EINVAL;
+		goto out_path;
+	}
+	err = path_permission(&path, MAY_ACCESS);
+	if (err)
+		goto out_path;
+
+	mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
+	inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto out_path;
+	}
+
+	inode->i_op = &bpf_token_iops;
+	inode->i_fop = &bpf_token_fops;
+	clear_nlink(inode); /* make sure it is unlinked */
+
+	file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		err = PTR_ERR(file);
+		goto out_file;
+	}
+
+	token = bpf_token_alloc();
+	if (!token) {
+		err = -ENOMEM;
+		goto out_file;
+	}
+
+	/* remember bpffs owning userns for future ns_capable() checks */
+	token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
+
+	mnt_opts = path.dentry->d_sb->s_fs_info;
+	token->allowed_cmds = mnt_opts->delegate_cmds;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		err = fd;
+		goto out_token;
+	}
+
+	file->private_data = token;
+	fd_install(fd, file);
+
+	path_put(&path);
+	return fd;
+
+out_token:
+	bpf_token_free(token);
+out_file:
+	fput(file);
+out_path:
+	path_put(&path);
+	return err;
+}
+
+struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_token *token;
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_token_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	token = f.file->private_data;
+	bpf_token_inc(token);
+	fdput(f);
+
+	return token;
+}
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
+{
+	if (!token)
+		return false;
+
+	return token->allowed_cmds & (1ULL << cmd);
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 70bfa997e896..78692911f4a0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -847,6 +847,37 @@  union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality it allows:
+ *		- a set of allowed bpf() syscall commands;
+ *		- a set of allowed BPF map types to be created with
+ *		BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
+ *		- a set of allowed BPF program types and BPF program attach
+ *		types to be loaded with BPF_PROG_LOAD command, if
+ *		BPF_PROG_LOAD itself is allowed.
+ *
+ *		BPF token is created (derived) from an instance of BPF FS,
+ *		assuming it has necessary delegation mount options specified.
+ *		BPF FS mount is specified with openat()-style path FD + string.
+ *		This BPF token can be passed as an extra parameter to various
+ *		bpf() syscall commands to grant BPF subsystem functionality to
+ *		unprivileged processes.
+ *
+ *		When created, BPF token is "associated" with the owning
+ *		user namespace of BPF FS instance (super block) that it was
+ *		derived from, and subsequent BPF operations performed with
+ *		BPF token would be performing capabilities checks (i.e.,
+ *		CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
+ *		that user namespace. Without BPF token, such capabilities
+ *		have to be granted in init user namespace, making bpf()
+ *		syscall incompatible with user namespace, for the most part.
+ *
+ *	Return
+ *		A new file descriptor (a nonnegative integer), or -1 if an
+ *		error occurred (in which case, *errno* is set appropriately).
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -901,6 +932,8 @@  enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
+	__MAX_BPF_CMD,
 };
 
 enum bpf_map_type {
@@ -1694,6 +1727,12 @@  union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		__u32		flags;
+		__u32		bpffs_path_fd;
+		__u64		bpffs_pathname;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF