diff mbox series

[v5,bpf-next,03/13] bpf: introduce BPF token object

Message ID 20230919214800.3803828-4-andrii@kernel.org (mailing list archive)
State New, archived
Headers show
Series BPF token and BPF FS-based delegation | expand

Commit Message

Andrii Nakryiko Sept. 19, 2023, 9:47 p.m. UTC
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while have a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
a path specification (using the usual fd + string path combo) to a BPF
FS mount. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the creation
time or after the fact, allowing the process to guard itself further
from, e.g., unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

The alternative to creating BPF token object was:
  a) not having any extra object and just pasing BPF FS path to each
     relevant bpf() command. This seems suboptimal as it's racy (mount
     under the same path might change in between checking it and using it
     for bpf() command). And also less flexible if we'd like to further
     restrict ourselves compared to all the delegated functionality
     allowed on BPF FS.
  b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
     a dedicated FD that would represent a token-like functionality. This
     doesn't seem superior to having a proper bpf() command, so
     BPF_TOKEN_CREATE was chosen.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h            |  42 ++++++++
 include/uapi/linux/bpf.h       |  39 +++++++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/inode.c             |   4 +-
 kernel/bpf/syscall.c           |  17 +++
 kernel/bpf/token.c             | 189 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  39 +++++++
 7 files changed, 330 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/token.c

Comments

Christian Brauner Sept. 26, 2023, 4:21 p.m. UTC | #1
On Tue, Sep 19, 2023 at 02:47:50PM -0700, Andrii Nakryiko wrote:
> Add new kind of BPF kernel object, BPF token. BPF token is meant to
> allow delegating privileged BPF functionality, like loading a BPF
> program or creating a BPF map, from privileged process to a *trusted*
> unprivileged process, all while have a good amount of control over which
> privileged operations could be performed using provided BPF token.
> 
> This is achieved through mounting BPF FS instance with extra delegation
> mount options, which determine what operations are delegatable, and also
> constraining it to the owning user namespace (as mentioned in the
> previous patch).
> 
> BPF token itself is just a derivative from BPF FS and can be created
> through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> a path specification (using the usual fd + string path combo) to a BPF
> FS mount. Currently, BPF token "inherits" delegated command, map types,
> prog type, and attach type bit sets from BPF FS as is. In the future,
> having an BPF token as a separate object with its own FD, we can allow
> to further restrict BPF token's allowable set of things either at the creation
> time or after the fact, allowing the process to guard itself further
> from, e.g., unintentionally trying to load undesired kind of BPF
> programs. But for now we keep things simple and just copy bit sets as is.
> 
> When BPF token is created from BPF FS mount, we take reference to the
> BPF super block's owning user namespace, and then use that namespace for
> checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> capabilities that are normally only checked against init userns (using
> capable()), but now we check them using ns_capable() instead (if BPF
> token is provided). See bpf_token_capable() for details.
> 
> Such setup means that BPF token in itself is not sufficient to grant BPF
> functionality. User namespaced process has to *also* have necessary
> combination of capabilities inside that user namespace. So while
> previously CAP_BPF was useless when granted within user namespace, now
> it gains a meaning and allows container managers and sys admins to have
> a flexible control over which processes can and need to use BPF
> functionality within the user namespace (i.e., container in practice).
> And BPF FS delegation mount options and derived BPF tokens serve as
> a per-container "flag" to grant overall ability to use bpf() (plus further
> restrict on which parts of bpf() syscalls are treated as namespaced).
> 
> The alternative to creating BPF token object was:
>   a) not having any extra object and just pasing BPF FS path to each
>      relevant bpf() command. This seems suboptimal as it's racy (mount
>      under the same path might change in between checking it and using it
>      for bpf() command). And also less flexible if we'd like to further
>      restrict ourselves compared to all the delegated functionality
>      allowed on BPF FS.
>   b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
>      a dedicated FD that would represent a token-like functionality. This
>      doesn't seem superior to having a proper bpf() command, so
>      BPF_TOKEN_CREATE was chosen.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/bpf.h            |  42 ++++++++
>  include/uapi/linux/bpf.h       |  39 +++++++
>  kernel/bpf/Makefile            |   2 +-
>  kernel/bpf/inode.c             |   4 +-
>  kernel/bpf/syscall.c           |  17 +++
>  kernel/bpf/token.c             | 189 +++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  39 +++++++
>  7 files changed, 330 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/token.c
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 026923a60cad..ae13538f5465 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -51,6 +51,8 @@ struct module;
>  struct bpf_func_state;
>  struct ftrace_ops;
>  struct cgroup;
> +struct bpf_token;
> +struct user_namespace;
>  
>  extern struct idr btf_idr;
>  extern spinlock_t btf_idr_lock;
> @@ -1572,6 +1574,13 @@ struct bpf_mount_opts {
>  	u64 delegate_attachs;
>  };
>  
> +struct bpf_token {
> +	struct work_struct work;
> +	atomic64_t refcnt;
> +	struct user_namespace *userns;
> +	u64 allowed_cmds;
> +};
> +
>  struct bpf_struct_ops_value;
>  struct btf_member;
>  
> @@ -2162,6 +2171,8 @@ static inline void bpf_map_dec_elem_count(struct bpf_map *map)
>  
>  extern int sysctl_unprivileged_bpf_disabled;
>  
> +bool bpf_token_capable(const struct bpf_token *token, int cap);
> +
>  static inline bool bpf_allow_ptr_leaks(void)
>  {
>  	return perfmon_capable();
> @@ -2196,6 +2207,14 @@ int bpf_link_new_fd(struct bpf_link *link);
>  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
>  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
>  
> +void bpf_token_inc(struct bpf_token *token);
> +void bpf_token_put(struct bpf_token *token);
> +int bpf_token_create(union bpf_attr *attr);
> +int bpf_token_new_fd(struct bpf_token *token);
> +struct bpf_token *bpf_token_get_from_fd(u32 ufd);
> +
> +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
> +
>  int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
>  int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
>  
> @@ -2557,6 +2576,29 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags)
>  	return -EOPNOTSUPP;
>  }
>  
> +static inline bool bpf_token_capable(const struct bpf_token *token, int cap)
> +{
> +	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> +}
> +
> +static inline void bpf_token_inc(struct bpf_token *token)
> +{
> +}
> +
> +static inline void bpf_token_put(struct bpf_token *token)
> +{
> +}
> +
> +static inline int bpf_token_new_fd(struct bpf_token *token)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +
>  static inline void __dev_flush(void)
>  {
>  }
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 73b155e52204..36e98c6f8944 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -847,6 +847,37 @@ union bpf_iter_link_info {
>   *		Returns zero on success. On error, -1 is returned and *errno*
>   *		is set appropriately.
>   *
> + * BPF_TOKEN_CREATE
> + *	Description
> + *		Create BPF token with embedded information about what
> + *		BPF-related functionality it allows:
> + *		- a set of allowed bpf() syscall commands;
> + *		- a set of allowed BPF map types to be created with
> + *		BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
> + *		- a set of allowed BPF program types and BPF program attach
> + *		types to be loaded with BPF_PROG_LOAD command, if
> + *		BPF_PROG_LOAD itself is allowed.
> + *
> + *		BPF token is created (derived) from an instance of BPF FS,
> + *		assuming it has necessary delegation mount options specified.
> + *		BPF FS mount is specified with openat()-style path FD + string.
> + *		This BPF token can be passed as an extra parameter to various
> + *		bpf() syscall commands to grant BPF subsystem functionality to
> + *		unprivileged processes.
> + *
> + *		When created, BPF token is "associated" with the owning
> + *		user namespace of BPF FS instance (super block) that it was
> + *		derived from, and subsequent BPF operations performed with
> + *		BPF token would be performing capabilities checks (i.e.,
> + *		CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
> + *		that user namespace. Without BPF token, such capabilities
> + *		have to be granted in init user namespace, making bpf()
> + *		syscall incompatible with user namespace, for the most part.
> + *
> + *	Return
> + *		A new file descriptor (a nonnegative integer), or -1 if an
> + *		error occurred (in which case, *errno* is set appropriately).
> + *
>   * NOTES
>   *	eBPF objects (maps and programs) can be shared between processes.
>   *
> @@ -901,6 +932,8 @@ enum bpf_cmd {
>  	BPF_ITER_CREATE,
>  	BPF_LINK_DETACH,
>  	BPF_PROG_BIND_MAP,
> +	BPF_TOKEN_CREATE,
> +	__MAX_BPF_CMD,
>  };
>  
>  enum bpf_map_type {
> @@ -1694,6 +1727,12 @@ union bpf_attr {
>  		__u32		flags;		/* extra flags */
>  	} prog_bind_map;
>  
> +	struct { /* struct used by BPF_TOKEN_CREATE command */
> +		__u32		flags;
> +		__u32		bpffs_path_fd;
> +		__u64		bpffs_pathname;
> +	} token_create;
> +
>  } __attribute__((aligned(8)));
>  
>  /* The description below is an attempt at providing documentation to eBPF
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index f526b7573e97..4ce95acfcaa7 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
>  endif
>  CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
>  
> -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
> +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 8f66b57d3546..82f11fbffd3e 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -603,11 +603,13 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
>  {
>  	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
>  	umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
> +	u64 mask;
>  
>  	if (mode != S_IRWXUGO)
>  		seq_printf(m, ",mode=%o", mode);
>  
> -	if (opts->delegate_cmds == ~0ULL)
> +	mask = (1ULL << __MAX_BPF_CMD) - 1;
> +	if ((opts->delegate_cmds & mask) == mask)
>  		seq_printf(m, ",delegate_cmds=any");
>  	else if (opts->delegate_cmds)
>  		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index f024caee0bba..93338faa43d5 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -5302,6 +5302,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
>  	return ret;
>  }
>  
> +#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname
> +
> +static int token_create(union bpf_attr *attr)
> +{
> +	if (CHECK_ATTR(BPF_TOKEN_CREATE))
> +		return -EINVAL;
> +
> +	/* no flags are supported yet */
> +	if (attr->token_create.flags)
> +		return -EINVAL;
> +
> +	return bpf_token_create(attr);
> +}
> +
>  static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
>  {
>  	union bpf_attr attr;
> @@ -5435,6 +5449,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
>  	case BPF_PROG_BIND_MAP:
>  		err = bpf_prog_bind_map(&attr);
>  		break;
> +	case BPF_TOKEN_CREATE:
> +		err = token_create(&attr);
> +		break;
>  	default:
>  		err = -EINVAL;
>  		break;
> diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> new file mode 100644
> index 000000000000..f6ea3eddbee6
> --- /dev/null
> +++ b/kernel/bpf/token.c
> @@ -0,0 +1,189 @@
> +#include <linux/bpf.h>
> +#include <linux/vmalloc.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/fdtable.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/kernel.h>
> +#include <linux/idr.h>
> +#include <linux/namei.h>
> +#include <linux/user_namespace.h>
> +
> +bool bpf_token_capable(const struct bpf_token *token, int cap)
> +{
> +	/* BPF token allows ns_capable() level of capabilities */
> +	if (token) {
> +		if (ns_capable(token->userns, cap))
> +			return true;
> +		if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> +			return true;
> +	}
> +	/* otherwise fallback to capable() checks */
> +	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> +}
> +
> +void bpf_token_inc(struct bpf_token *token)
> +{
> +	atomic64_inc(&token->refcnt);
> +}
> +
> +static void bpf_token_free(struct bpf_token *token)
> +{
> +	put_user_ns(token->userns);
> +	kvfree(token);
> +}
> +
> +static void bpf_token_put_deferred(struct work_struct *work)
> +{
> +	struct bpf_token *token = container_of(work, struct bpf_token, work);
> +
> +	bpf_token_free(token);
> +}
> +
> +void bpf_token_put(struct bpf_token *token)
> +{
> +	if (!token)
> +		return;
> +
> +	if (!atomic64_dec_and_test(&token->refcnt))
> +		return;
> +
> +	INIT_WORK(&token->work, bpf_token_put_deferred);
> +	schedule_work(&token->work);
> +}
> +
> +static int bpf_token_release(struct inode *inode, struct file *filp)
> +{
> +	struct bpf_token *token = filp->private_data;
> +
> +	bpf_token_put(token);
> +	return 0;
> +}
> +
> +static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz,
> +			      loff_t *ppos)
> +{
> +	/* We need this handler such that alloc_file() enables
> +	 * f_mode with FMODE_CAN_READ.
> +	 */
> +	return -EINVAL;
> +}
> +
> +static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf,
> +			       size_t siz, loff_t *ppos)
> +{
> +	/* We need this handler such that alloc_file() enables
> +	 * f_mode with FMODE_CAN_WRITE.
> +	 */
> +	return -EINVAL;
> +}
> +
> +static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
> +{
> +	struct bpf_token *token = filp->private_data;
> +	u64 mask;
> +
> +	mask = (1ULL << __MAX_BPF_CMD) - 1;
> +	if ((token->allowed_cmds & mask) == mask)
> +		seq_printf(m, "allowed_cmds:\tany\n");
> +	else
> +		seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds);
> +}
> +
> +static const struct file_operations bpf_token_fops = {
> +	.release	= bpf_token_release,
> +	.read		= bpf_dummy_read,
> +	.write		= bpf_dummy_write,
> +	.show_fdinfo	= bpf_token_show_fdinfo,
> +};
> +
> +static struct bpf_token *bpf_token_alloc(void)
> +{
> +	struct bpf_token *token;
> +
> +	token = kvzalloc(sizeof(*token), GFP_USER);
> +	if (!token)
> +		return NULL;
> +
> +	atomic64_set(&token->refcnt, 1);
> +
> +	return token;
> +}
> +
> +int bpf_token_create(union bpf_attr *attr)
> +{
> +	struct path path;
> +	struct bpf_mount_opts *mnt_opts;
> +	struct bpf_token *token;
> +	int ret;
> +
> +	ret = user_path_at(attr->token_create.bpffs_path_fd,
> +			   u64_to_user_ptr(attr->token_create.bpffs_pathname),
> +			   LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> +	if (ret)
> +		return ret;
> +
> +	if (path.mnt->mnt_root != path.dentry) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	ret = path_permission(&path, MAY_ACCESS);
> +	if (ret)
> +		goto out;
> +
> +	token = bpf_token_alloc();
> +	if (!token) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/* remember bpffs owning userns for future ns_capable() checks */
> +	token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> +
> +	mnt_opts = path.dentry->d_sb->s_fs_info;
> +	token->allowed_cmds = mnt_opts->delegate_cmds;
> +
> +	ret = bpf_token_new_fd(token);
> +	if (ret < 0)
> +		bpf_token_free(token);
> +out:
> +	path_put(&path);
> +	return ret;
> +}
> +
> +#define BPF_TOKEN_INODE_NAME "bpf-token"
> +
> +/* Alloc anon_inode and FD for prepared token.
> + * Returns fd >= 0 on success; negative error, otherwise.
> + */
> +int bpf_token_new_fd(struct bpf_token *token)
> +{
> +	return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);

It's unnecessary to use the anonymous inode infrastructure for bpf
tokens. It adds even more moving parts and makes reasoning about it even
harder. Just keep it all in bpffs. IIRC, something like the following
(broken, non-compiling draft) should work:

/* bpf_token_file - get an unlinked file living in bpffs */
struct file *bpf_token_file(...)
{
        inode = bpf_get_inode(bpffs_mnt->mnt_sb, dir, mode);
        inode->i_op = &bpf_token_iop;
        inode->i_fop = &bpf_token_fops;

	// some other stuff you might want or need

        res = alloc_file_pseudo(inode, bpffs_mnt, "bpf-token", O_RDWR, &bpf_token_fops);
}

Now set your private data that you might need, reserve an fd, install
the file into the fdtable and return the fd. You should have an unlinked
bpffs file that serves as your bpf token.
Andrii Nakryiko Sept. 26, 2023, 10:10 p.m. UTC | #2
On Tue, Sep 26, 2023 at 9:21 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Sep 19, 2023 at 02:47:50PM -0700, Andrii Nakryiko wrote:
> > Add new kind of BPF kernel object, BPF token. BPF token is meant to
> > allow delegating privileged BPF functionality, like loading a BPF
> > program or creating a BPF map, from privileged process to a *trusted*
> > unprivileged process, all while have a good amount of control over which
> > privileged operations could be performed using provided BPF token.
> >
> > This is achieved through mounting BPF FS instance with extra delegation
> > mount options, which determine what operations are delegatable, and also
> > constraining it to the owning user namespace (as mentioned in the
> > previous patch).
> >
> > BPF token itself is just a derivative from BPF FS and can be created
> > through a new bpf() syscall command, BPF_TOKEN_CREAT, which accepts
> > a path specification (using the usual fd + string path combo) to a BPF
> > FS mount. Currently, BPF token "inherits" delegated command, map types,
> > prog type, and attach type bit sets from BPF FS as is. In the future,
> > having an BPF token as a separate object with its own FD, we can allow
> > to further restrict BPF token's allowable set of things either at the creation
> > time or after the fact, allowing the process to guard itself further
> > from, e.g., unintentionally trying to load undesired kind of BPF
> > programs. But for now we keep things simple and just copy bit sets as is.
> >
> > When BPF token is created from BPF FS mount, we take reference to the
> > BPF super block's owning user namespace, and then use that namespace for
> > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> > capabilities that are normally only checked against init userns (using
> > capable()), but now we check them using ns_capable() instead (if BPF
> > token is provided). See bpf_token_capable() for details.
> >
> > Such setup means that BPF token in itself is not sufficient to grant BPF
> > functionality. User namespaced process has to *also* have necessary
> > combination of capabilities inside that user namespace. So while
> > previously CAP_BPF was useless when granted within user namespace, now
> > it gains a meaning and allows container managers and sys admins to have
> > a flexible control over which processes can and need to use BPF
> > functionality within the user namespace (i.e., container in practice).
> > And BPF FS delegation mount options and derived BPF tokens serve as
> > a per-container "flag" to grant overall ability to use bpf() (plus further
> > restrict on which parts of bpf() syscalls are treated as namespaced).
> >
> > The alternative to creating BPF token object was:
> >   a) not having any extra object and just pasing BPF FS path to each
> >      relevant bpf() command. This seems suboptimal as it's racy (mount
> >      under the same path might change in between checking it and using it
> >      for bpf() command). And also less flexible if we'd like to further
> >      restrict ourselves compared to all the delegated functionality
> >      allowed on BPF FS.
> >   b) use non-bpf() interface, e.g., ioctl(), but otherwise also create
> >      a dedicated FD that would represent a token-like functionality. This
> >      doesn't seem superior to having a proper bpf() command, so
> >      BPF_TOKEN_CREATE was chosen.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf.h            |  42 ++++++++
> >  include/uapi/linux/bpf.h       |  39 +++++++
> >  kernel/bpf/Makefile            |   2 +-
> >  kernel/bpf/inode.c             |   4 +-
> >  kernel/bpf/syscall.c           |  17 +++
> >  kernel/bpf/token.c             | 189 +++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h |  39 +++++++
> >  7 files changed, 330 insertions(+), 2 deletions(-)
> >  create mode 100644 kernel/bpf/token.c
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 026923a60cad..ae13538f5465 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -51,6 +51,8 @@ struct module;
> >  struct bpf_func_state;
> >  struct ftrace_ops;
> >  struct cgroup;
> > +struct bpf_token;
> > +struct user_namespace;
> >
> >  extern struct idr btf_idr;
> >  extern spinlock_t btf_idr_lock;
> > @@ -1572,6 +1574,13 @@ struct bpf_mount_opts {
> >       u64 delegate_attachs;
> >  };
> >
> > +struct bpf_token {
> > +     struct work_struct work;
> > +     atomic64_t refcnt;
> > +     struct user_namespace *userns;
> > +     u64 allowed_cmds;
> > +};
> > +
> >  struct bpf_struct_ops_value;
> >  struct btf_member;
> >
> > @@ -2162,6 +2171,8 @@ static inline void bpf_map_dec_elem_count(struct bpf_map *map)
> >
> >  extern int sysctl_unprivileged_bpf_disabled;
> >
> > +bool bpf_token_capable(const struct bpf_token *token, int cap);
> > +
> >  static inline bool bpf_allow_ptr_leaks(void)
> >  {
> >       return perfmon_capable();
> > @@ -2196,6 +2207,14 @@ int bpf_link_new_fd(struct bpf_link *link);
> >  struct bpf_link *bpf_link_get_from_fd(u32 ufd);
> >  struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
> >
> > +void bpf_token_inc(struct bpf_token *token);
> > +void bpf_token_put(struct bpf_token *token);
> > +int bpf_token_create(union bpf_attr *attr);
> > +int bpf_token_new_fd(struct bpf_token *token);
> > +struct bpf_token *bpf_token_get_from_fd(u32 ufd);
> > +
> > +bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
> > +
> >  int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
> >  int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
> >
> > @@ -2557,6 +2576,29 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags)
> >       return -EOPNOTSUPP;
> >  }
> >
> > +static inline bool bpf_token_capable(const struct bpf_token *token, int cap)
> > +{
> > +     return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > +}
> > +
> > +static inline void bpf_token_inc(struct bpf_token *token)
> > +{
> > +}
> > +
> > +static inline void bpf_token_put(struct bpf_token *token)
> > +{
> > +}
> > +
> > +static inline int bpf_token_new_fd(struct bpf_token *token)
> > +{
> > +     return -EOPNOTSUPP;
> > +}
> > +
> > +static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
> > +{
> > +     return ERR_PTR(-EOPNOTSUPP);
> > +}
> > +
> >  static inline void __dev_flush(void)
> >  {
> >  }
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 73b155e52204..36e98c6f8944 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -847,6 +847,37 @@ union bpf_iter_link_info {
> >   *           Returns zero on success. On error, -1 is returned and *errno*
> >   *           is set appropriately.
> >   *
> > + * BPF_TOKEN_CREATE
> > + *   Description
> > + *           Create BPF token with embedded information about what
> > + *           BPF-related functionality it allows:
> > + *           - a set of allowed bpf() syscall commands;
> > + *           - a set of allowed BPF map types to be created with
> > + *           BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
> > + *           - a set of allowed BPF program types and BPF program attach
> > + *           types to be loaded with BPF_PROG_LOAD command, if
> > + *           BPF_PROG_LOAD itself is allowed.
> > + *
> > + *           BPF token is created (derived) from an instance of BPF FS,
> > + *           assuming it has necessary delegation mount options specified.
> > + *           BPF FS mount is specified with openat()-style path FD + string.
> > + *           This BPF token can be passed as an extra parameter to various
> > + *           bpf() syscall commands to grant BPF subsystem functionality to
> > + *           unprivileged processes.
> > + *
> > + *           When created, BPF token is "associated" with the owning
> > + *           user namespace of BPF FS instance (super block) that it was
> > + *           derived from, and subsequent BPF operations performed with
> > + *           BPF token would be performing capabilities checks (i.e.,
> > + *           CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
> > + *           that user namespace. Without BPF token, such capabilities
> > + *           have to be granted in init user namespace, making bpf()
> > + *           syscall incompatible with user namespace, for the most part.
> > + *
> > + *   Return
> > + *           A new file descriptor (a nonnegative integer), or -1 if an
> > + *           error occurred (in which case, *errno* is set appropriately).
> > + *
> >   * NOTES
> >   *   eBPF objects (maps and programs) can be shared between processes.
> >   *
> > @@ -901,6 +932,8 @@ enum bpf_cmd {
> >       BPF_ITER_CREATE,
> >       BPF_LINK_DETACH,
> >       BPF_PROG_BIND_MAP,
> > +     BPF_TOKEN_CREATE,
> > +     __MAX_BPF_CMD,
> >  };
> >
> >  enum bpf_map_type {
> > @@ -1694,6 +1727,12 @@ union bpf_attr {
> >               __u32           flags;          /* extra flags */
> >       } prog_bind_map;
> >
> > +     struct { /* struct used by BPF_TOKEN_CREATE command */
> > +             __u32           flags;
> > +             __u32           bpffs_path_fd;
> > +             __u64           bpffs_pathname;
> > +     } token_create;
> > +
> >  } __attribute__((aligned(8)));
> >
> >  /* The description below is an attempt at providing documentation to eBPF
> > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> > index f526b7573e97..4ce95acfcaa7 100644
> > --- a/kernel/bpf/Makefile
> > +++ b/kernel/bpf/Makefile
> > @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
> >  endif
> >  CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
> >
> > -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
> > +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
> >  obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
> >  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
> >  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
> > diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> > index 8f66b57d3546..82f11fbffd3e 100644
> > --- a/kernel/bpf/inode.c
> > +++ b/kernel/bpf/inode.c
> > @@ -603,11 +603,13 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
> >  {
> >       struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
> >       umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
> > +     u64 mask;
> >
> >       if (mode != S_IRWXUGO)
> >               seq_printf(m, ",mode=%o", mode);
> >
> > -     if (opts->delegate_cmds == ~0ULL)
> > +     mask = (1ULL << __MAX_BPF_CMD) - 1;
> > +     if ((opts->delegate_cmds & mask) == mask)
> >               seq_printf(m, ",delegate_cmds=any");
> >       else if (opts->delegate_cmds)
> >               seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index f024caee0bba..93338faa43d5 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -5302,6 +5302,20 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
> >       return ret;
> >  }
> >
> > +#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname
> > +
> > +static int token_create(union bpf_attr *attr)
> > +{
> > +     if (CHECK_ATTR(BPF_TOKEN_CREATE))
> > +             return -EINVAL;
> > +
> > +     /* no flags are supported yet */
> > +     if (attr->token_create.flags)
> > +             return -EINVAL;
> > +
> > +     return bpf_token_create(attr);
> > +}
> > +
> >  static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
> >  {
> >       union bpf_attr attr;
> > @@ -5435,6 +5449,9 @@ static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
> >       case BPF_PROG_BIND_MAP:
> >               err = bpf_prog_bind_map(&attr);
> >               break;
> > +     case BPF_TOKEN_CREATE:
> > +             err = token_create(&attr);
> > +             break;
> >       default:
> >               err = -EINVAL;
> >               break;
> > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> > new file mode 100644
> > index 000000000000..f6ea3eddbee6
> > --- /dev/null
> > +++ b/kernel/bpf/token.c
> > @@ -0,0 +1,189 @@
> > +#include <linux/bpf.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/fdtable.h>
> > +#include <linux/file.h>
> > +#include <linux/fs.h>
> > +#include <linux/kernel.h>
> > +#include <linux/idr.h>
> > +#include <linux/namei.h>
> > +#include <linux/user_namespace.h>
> > +
> > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > +{
> > +     /* BPF token allows ns_capable() level of capabilities */
> > +     if (token) {
> > +             if (ns_capable(token->userns, cap))
> > +                     return true;
> > +             if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > +                     return true;
> > +     }
> > +     /* otherwise fallback to capable() checks */
> > +     return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > +}
> > +
> > +void bpf_token_inc(struct bpf_token *token)
> > +{
> > +     atomic64_inc(&token->refcnt);
> > +}
> > +
> > +static void bpf_token_free(struct bpf_token *token)
> > +{
> > +     put_user_ns(token->userns);
> > +     kvfree(token);
> > +}
> > +
> > +static void bpf_token_put_deferred(struct work_struct *work)
> > +{
> > +     struct bpf_token *token = container_of(work, struct bpf_token, work);
> > +
> > +     bpf_token_free(token);
> > +}
> > +
> > +void bpf_token_put(struct bpf_token *token)
> > +{
> > +     if (!token)
> > +             return;
> > +
> > +     if (!atomic64_dec_and_test(&token->refcnt))
> > +             return;
> > +
> > +     INIT_WORK(&token->work, bpf_token_put_deferred);
> > +     schedule_work(&token->work);
> > +}
> > +
> > +static int bpf_token_release(struct inode *inode, struct file *filp)
> > +{
> > +     struct bpf_token *token = filp->private_data;
> > +
> > +     bpf_token_put(token);
> > +     return 0;
> > +}
> > +
> > +static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz,
> > +                           loff_t *ppos)
> > +{
> > +     /* We need this handler such that alloc_file() enables
> > +      * f_mode with FMODE_CAN_READ.
> > +      */
> > +     return -EINVAL;
> > +}
> > +
> > +static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf,
> > +                            size_t siz, loff_t *ppos)
> > +{
> > +     /* We need this handler such that alloc_file() enables
> > +      * f_mode with FMODE_CAN_WRITE.
> > +      */
> > +     return -EINVAL;
> > +}
> > +
> > +static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
> > +{
> > +     struct bpf_token *token = filp->private_data;
> > +     u64 mask;
> > +
> > +     mask = (1ULL << __MAX_BPF_CMD) - 1;
> > +     if ((token->allowed_cmds & mask) == mask)
> > +             seq_printf(m, "allowed_cmds:\tany\n");
> > +     else
> > +             seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds);
> > +}
> > +
> > +static const struct file_operations bpf_token_fops = {
> > +     .release        = bpf_token_release,
> > +     .read           = bpf_dummy_read,
> > +     .write          = bpf_dummy_write,
> > +     .show_fdinfo    = bpf_token_show_fdinfo,
> > +};
> > +
> > +static struct bpf_token *bpf_token_alloc(void)
> > +{
> > +     struct bpf_token *token;
> > +
> > +     token = kvzalloc(sizeof(*token), GFP_USER);
> > +     if (!token)
> > +             return NULL;
> > +
> > +     atomic64_set(&token->refcnt, 1);
> > +
> > +     return token;
> > +}
> > +
> > +int bpf_token_create(union bpf_attr *attr)
> > +{
> > +     struct path path;
> > +     struct bpf_mount_opts *mnt_opts;
> > +     struct bpf_token *token;
> > +     int ret;
> > +
> > +     ret = user_path_at(attr->token_create.bpffs_path_fd,
> > +                        u64_to_user_ptr(attr->token_create.bpffs_pathname),
> > +                        LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
> > +     if (ret)
> > +             return ret;
> > +
> > +     if (path.mnt->mnt_root != path.dentry) {
> > +             ret = -EINVAL;
> > +             goto out;
> > +     }
> > +     ret = path_permission(&path, MAY_ACCESS);
> > +     if (ret)
> > +             goto out;
> > +
> > +     token = bpf_token_alloc();
> > +     if (!token) {
> > +             ret = -ENOMEM;
> > +             goto out;
> > +     }
> > +
> > +     /* remember bpffs owning userns for future ns_capable() checks */
> > +     token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
> > +
> > +     mnt_opts = path.dentry->d_sb->s_fs_info;
> > +     token->allowed_cmds = mnt_opts->delegate_cmds;
> > +
> > +     ret = bpf_token_new_fd(token);
> > +     if (ret < 0)
> > +             bpf_token_free(token);
> > +out:
> > +     path_put(&path);
> > +     return ret;
> > +}
> > +
> > +#define BPF_TOKEN_INODE_NAME "bpf-token"
> > +
> > +/* Alloc anon_inode and FD for prepared token.
> > + * Returns fd >= 0 on success; negative error, otherwise.
> > + */
> > +int bpf_token_new_fd(struct bpf_token *token)
> > +{
> > +     return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);
>
> It's unnecessary to use the anonymous inode infrastructure for bpf
> tokens. It adds even more moving parts and makes reasoning about it even
> harder. Just keep it all in bpffs. IIRC, something like the following
> (broken, non-compiling draft) should work:
>
> /* bpf_token_file - get an unlinked file living in bpffs */
> struct file *bpf_token_file(...)
> {
>         inode = bpf_get_inode(bpffs_mnt->mnt_sb, dir, mode);
>         inode->i_op = &bpf_token_iop;
>         inode->i_fop = &bpf_token_fops;
>
>         // some other stuff you might want or need
>
>         res = alloc_file_pseudo(inode, bpffs_mnt, "bpf-token", O_RDWR, &bpf_token_fops);
> }
>
> Now set your private data that you might need, reserve an fd, install
> the file into the fdtable and return the fd. You should have an unlinked
> bpffs file that serves as your bpf token.

Just to make sure I understand. You are saying that instead of having
`struct bpf_token *` and passing that into internal APIs
(bpf_token_capable() and bpf_token_allow_xxx()), I should just pass
around `struct super_block *` representing BPF FS instance? Or `struct
bpf_mount_opts *` maybe? Or 'struct vfsmount *'? (Any preferences
here?). Is that right?

The point is not to have a struct bpf_token that keeps its own
refcount, doesn't maintain its own allowed_xxx masks, and doesn't keep
a refcnt on userns.

Should I worry about refcounting of the super_block? It was a nice
property that I could store bpf_token inside the program for some
future checks that could happen during attach time after the BPF
program is loaded and verified. How do I achieve the same if I need a
super_block around? Is there some get/put-like APIs for super_block
(or vfsmount?) that I can use for that? I'm sorry if it's stupid
questions, just trying to cover all the ground before I reimplement
portions of this patch set again. Thanks for understanding!
Christian Brauner Sept. 27, 2023, 9:52 a.m. UTC | #3
> > > +#define BPF_TOKEN_INODE_NAME "bpf-token"
> > > +
> > > +/* Alloc anon_inode and FD for prepared token.
> > > + * Returns fd >= 0 on success; negative error, otherwise.
> > > + */
> > > +int bpf_token_new_fd(struct bpf_token *token)
> > > +{
> > > +     return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);
> >
> > It's unnecessary to use the anonymous inode infrastructure for bpf
> > tokens. It adds even more moving parts and makes reasoning about it even
> > harder. Just keep it all in bpffs. IIRC, something like the following
> > (broken, non-compiling draft) should work:
> >
> > /* bpf_token_file - get an unlinked file living in bpffs */
> > struct file *bpf_token_file(...)
> > {
> >         inode = bpf_get_inode(bpffs_mnt->mnt_sb, dir, mode);
> >         inode->i_op = &bpf_token_iop;
> >         inode->i_fop = &bpf_token_fops;
> >
> >         // some other stuff you might want or need
> >
> >         res = alloc_file_pseudo(inode, bpffs_mnt, "bpf-token", O_RDWR, &bpf_token_fops);
> > }
> >
> > Now set your private data that you might need, reserve an fd, install
> > the file into the fdtable and return the fd. You should have an unlinked
> > bpffs file that serves as your bpf token.
> 
> Just to make sure I understand. You are saying that instead of having
> `struct bpf_token *` and passing that into internal APIs
> (bpf_token_capable() and bpf_token_allow_xxx()), I should just pass
> around `struct super_block *` representing BPF FS instance? Or `struct
> bpf_mount_opts *` maybe? Or 'struct vfsmount *'? (Any preferences
> here?). Is that right?

No, that's not what I meant.

So, what you're doing right now to create a bpf token file descriptor is:

return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);

which is using the anonymous inode infrastructure. That is an entirely
different filesystems (glossing over details) that is best leveraged for
stuff like kvm fds and other stuff that doesn't need or have its own
filesytem implementation.

But you do have your own filesystem implementation so why abuse another
one to create bpf token fds when they can just be created directly from
the bpffs instance.

IOW, everything stays the same apart from the fact that bpf token fds
are actually file descriptors referring to a detached bpffs file instead
of an anonymous inode file. IOW, bpf tokens are actual bpffs objects
tied to a bpffs instance.

**BROKEN BROKEN BROKEN AND UGLY**

int bpf_token_create(union bpf_attr *attr)
{
        struct inode *inode;
        struct path path;
        struct bpf_mount_opts *mnt_opts;
        struct bpf_token *token;
        struct fd fd;
        int fd, ret;
        struct file *file;

        fd = fdget(attr->token_create.bpffs_path_fd);
        if (!fd.file)
                goto cleanup;

        if (fd.file->f_path->dentry != fd.file->f_path->dentry->d_sb->s_root)
                goto cleanup;

        inode = bpf_get_inode(fd.file->f_path->mnt->mnt_sb, NULL, 1234123412341234);
        if (!inode)
                goto cleanup;

        fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
        if (fd < 0)
                goto cleanup;

        clear_nlink(inode); /* make sure it is unlinked */

        file = alloc_file_pseudo(inode, fd.file->f_path->mnt, "bpf-token", O_RDWR, &&bpf_token_fops);
        if (IS_ERR(file))
                goto cleanup;

        token = bpf_token_alloc();
        if (!token)
                goto cleanup;

        /* remember bpffs owning userns for future ns_capable() checks */
        token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);

        mnt_opts = path.dentry->d_sb->s_fs_info;
        token->allowed_cmds = mnt_opts->delegate_cmds;
        token->allowed_maps = mnt_opts->delegate_maps;
        token->allowed_progs = mnt_opts->delegate_progs;
        token->allowed_attachs = mnt_opts->delegate_attachs;

        file->private_data = token;
        fd_install(fd, file);
        return fd;

cleanup:
        // cleanup stuff here
        return -SOME_ERROR;
}
Andrii Nakryiko Sept. 27, 2023, 4:03 p.m. UTC | #4
On Wed, Sep 27, 2023 at 2:52 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > > > +#define BPF_TOKEN_INODE_NAME "bpf-token"
> > > > +
> > > > +/* Alloc anon_inode and FD for prepared token.
> > > > + * Returns fd >= 0 on success; negative error, otherwise.
> > > > + */
> > > > +int bpf_token_new_fd(struct bpf_token *token)
> > > > +{
> > > > +     return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);
> > >
> > > It's unnecessary to use the anonymous inode infrastructure for bpf
> > > tokens. It adds even more moving parts and makes reasoning about it even
> > > harder. Just keep it all in bpffs. IIRC, something like the following
> > > (broken, non-compiling draft) should work:
> > >
> > > /* bpf_token_file - get an unlinked file living in bpffs */
> > > struct file *bpf_token_file(...)
> > > {
> > >         inode = bpf_get_inode(bpffs_mnt->mnt_sb, dir, mode);
> > >         inode->i_op = &bpf_token_iop;
> > >         inode->i_fop = &bpf_token_fops;
> > >
> > >         // some other stuff you might want or need
> > >
> > >         res = alloc_file_pseudo(inode, bpffs_mnt, "bpf-token", O_RDWR, &bpf_token_fops);
> > > }
> > >
> > > Now set your private data that you might need, reserve an fd, install
> > > the file into the fdtable and return the fd. You should have an unlinked
> > > bpffs file that serves as your bpf token.
> >
> > Just to make sure I understand. You are saying that instead of having
> > `struct bpf_token *` and passing that into internal APIs
> > (bpf_token_capable() and bpf_token_allow_xxx()), I should just pass
> > around `struct super_block *` representing BPF FS instance? Or `struct
> > bpf_mount_opts *` maybe? Or 'struct vfsmount *'? (Any preferences
> > here?). Is that right?
>
> No, that's not what I meant.
>
> So, what you're doing right now to create a bpf token file descriptor is:
>
> return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);
>
> which is using the anonymous inode infrastructure. That is an entirely
> different filesystems (glossing over details) that is best leveraged for
> stuff like kvm fds and other stuff that doesn't need or have its own
> filesytem implementation.
>
> But you do have your own filesystem implementation so why abuse another
> one to create bpf token fds when they can just be created directly from
> the bpffs instance.
>
> IOW, everything stays the same apart from the fact that bpf token fds
> are actually file descriptors referring to a detached bpffs file instead
> of an anonymous inode file. IOW, bpf tokens are actual bpffs objects
> tied to a bpffs instance.

Ah, ok, this is a much smaller change than what I was about to make.
I'm glad I asked and thanks for elaborating! I'll use
alloc_file_pseudo() using bpffs mount in the next revision.

>
> **BROKEN BROKEN BROKEN AND UGLY**
>
> int bpf_token_create(union bpf_attr *attr)
> {
>         struct inode *inode;
>         struct path path;
>         struct bpf_mount_opts *mnt_opts;
>         struct bpf_token *token;
>         struct fd fd;
>         int fd, ret;
>         struct file *file;
>
>         fd = fdget(attr->token_create.bpffs_path_fd);
>         if (!fd.file)
>                 goto cleanup;
>
>         if (fd.file->f_path->dentry != fd.file->f_path->dentry->d_sb->s_root)
>                 goto cleanup;
>
>         inode = bpf_get_inode(fd.file->f_path->mnt->mnt_sb, NULL, 1234123412341234);
>         if (!inode)
>                 goto cleanup;
>
>         fd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
>         if (fd < 0)
>                 goto cleanup;
>
>         clear_nlink(inode); /* make sure it is unlinked */
>
>         file = alloc_file_pseudo(inode, fd.file->f_path->mnt, "bpf-token", O_RDWR, &&bpf_token_fops);
>         if (IS_ERR(file))
>                 goto cleanup;
>
>         token = bpf_token_alloc();
>         if (!token)
>                 goto cleanup;
>
>         /* remember bpffs owning userns for future ns_capable() checks */
>         token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
>
>         mnt_opts = path.dentry->d_sb->s_fs_info;
>         token->allowed_cmds = mnt_opts->delegate_cmds;
>         token->allowed_maps = mnt_opts->delegate_maps;
>         token->allowed_progs = mnt_opts->delegate_progs;
>         token->allowed_attachs = mnt_opts->delegate_attachs;
>
>         file->private_data = token;
>         fd_install(fd, file);
>         return fd;
>
> cleanup:
>         // cleanup stuff here
>         return -SOME_ERROR;
> }
Paul Moore Oct. 9, 2023, 10:53 p.m. UTC | #5
On Wed, Sep 27, 2023 at 12:03 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Wed, Sep 27, 2023 at 2:52 AM Christian Brauner <brauner@kernel.org> wrote:

...

> > IOW, everything stays the same apart from the fact that bpf token fds
> > are actually file descriptors referring to a detached bpffs file instead
> > of an anonymous inode file. IOW, bpf tokens are actual bpffs objects
> > tied to a bpffs instance.
>
> Ah, ok, this is a much smaller change than what I was about to make.
> I'm glad I asked and thanks for elaborating! I'll use
> alloc_file_pseudo() using bpffs mount in the next revision.

Just a FYI, I'm still looking at v6 now, but moving from an anon_inode
to a bpffs inode may mean we need to drop a LSM hook in
bpf_token_create() to help mark the inode as a BPF token.  Not a big
deal either way, and I think it makes sense to use a bpffs inode as
opposed to an anonymous inode, just wanted to let you know.
Andrii Nakryiko Oct. 10, 2023, 12:23 a.m. UTC | #6
On Mon, Oct 9, 2023 at 3:53 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Wed, Sep 27, 2023 at 12:03 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Wed, Sep 27, 2023 at 2:52 AM Christian Brauner <brauner@kernel.org> wrote:
>
> ...
>
> > > IOW, everything stays the same apart from the fact that bpf token fds
> > > are actually file descriptors referring to a detached bpffs file instead
> > > of an anonymous inode file. IOW, bpf tokens are actual bpffs objects
> > > tied to a bpffs instance.
> >
> > Ah, ok, this is a much smaller change than what I was about to make.
> > I'm glad I asked and thanks for elaborating! I'll use
> > alloc_file_pseudo() using bpffs mount in the next revision.
>
> Just a FYI, I'm still looking at v6 now, but moving from an anon_inode
> to a bpffs inode may mean we need to drop a LSM hook in
> bpf_token_create() to help mark the inode as a BPF token.  Not a big
> deal either way, and I think it makes sense to use a bpffs inode as
> opposed to an anonymous inode, just wanted to let you know.

Thanks for the heads up. I was about to post a new revision rebased on
the latest bpf-next and with an unrelated selftest fix, but I'll give
it a bit more time to get your feedback and incorporate it into the
next revision. Thanks!

>
> --
> paul-moore.com
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 026923a60cad..ae13538f5465 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -51,6 +51,8 @@  struct module;
 struct bpf_func_state;
 struct ftrace_ops;
 struct cgroup;
+struct bpf_token;
+struct user_namespace;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1572,6 +1574,13 @@  struct bpf_mount_opts {
 	u64 delegate_attachs;
 };
 
+struct bpf_token {
+	struct work_struct work;
+	atomic64_t refcnt;
+	struct user_namespace *userns;
+	u64 allowed_cmds;
+};
+
 struct bpf_struct_ops_value;
 struct btf_member;
 
@@ -2162,6 +2171,8 @@  static inline void bpf_map_dec_elem_count(struct bpf_map *map)
 
 extern int sysctl_unprivileged_bpf_disabled;
 
+bool bpf_token_capable(const struct bpf_token *token, int cap);
+
 static inline bool bpf_allow_ptr_leaks(void)
 {
 	return perfmon_capable();
@@ -2196,6 +2207,14 @@  int bpf_link_new_fd(struct bpf_link *link);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
 
+void bpf_token_inc(struct bpf_token *token);
+void bpf_token_put(struct bpf_token *token);
+int bpf_token_create(union bpf_attr *attr);
+int bpf_token_new_fd(struct bpf_token *token);
+struct bpf_token *bpf_token_get_from_fd(u32 ufd);
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
+
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
 
@@ -2557,6 +2576,29 @@  static inline int bpf_obj_get_user(const char __user *pathname, int flags)
 	return -EOPNOTSUPP;
 }
 
+static inline bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}
+
+static inline void bpf_token_inc(struct bpf_token *token)
+{
+}
+
+static inline void bpf_token_put(struct bpf_token *token)
+{
+}
+
+static inline int bpf_token_new_fd(struct bpf_token *token)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline void __dev_flush(void)
 {
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 73b155e52204..36e98c6f8944 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -847,6 +847,37 @@  union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality it allows:
+ *		- a set of allowed bpf() syscall commands;
+ *		- a set of allowed BPF map types to be created with
+ *		BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
+ *		- a set of allowed BPF program types and BPF program attach
+ *		types to be loaded with BPF_PROG_LOAD command, if
+ *		BPF_PROG_LOAD itself is allowed.
+ *
+ *		BPF token is created (derived) from an instance of BPF FS,
+ *		assuming it has necessary delegation mount options specified.
+ *		BPF FS mount is specified with openat()-style path FD + string.
+ *		This BPF token can be passed as an extra parameter to various
+ *		bpf() syscall commands to grant BPF subsystem functionality to
+ *		unprivileged processes.
+ *
+ *		When created, BPF token is "associated" with the owning
+ *		user namespace of BPF FS instance (super block) that it was
+ *		derived from, and subsequent BPF operations performed with
+ *		BPF token would be performing capabilities checks (i.e.,
+ *		CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
+ *		that user namespace. Without BPF token, such capabilities
+ *		have to be granted in init user namespace, making bpf()
+ *		syscall incompatible with user namespace, for the most part.
+ *
+ *	Return
+ *		A new file descriptor (a nonnegative integer), or -1 if an
+ *		error occurred (in which case, *errno* is set appropriately).
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -901,6 +932,8 @@  enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
+	__MAX_BPF_CMD,
 };
 
 enum bpf_map_type {
@@ -1694,6 +1727,12 @@  union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		__u32		flags;
+		__u32		bpffs_path_fd;
+		__u64		bpffs_pathname;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f526b7573e97..4ce95acfcaa7 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@  cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
 endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 8f66b57d3546..82f11fbffd3e 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -603,11 +603,13 @@  static int bpf_show_options(struct seq_file *m, struct dentry *root)
 {
 	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
 	umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
+	u64 mask;
 
 	if (mode != S_IRWXUGO)
 		seq_printf(m, ",mode=%o", mode);
 
-	if (opts->delegate_cmds == ~0ULL)
+	mask = (1ULL << __MAX_BPF_CMD) - 1;
+	if ((opts->delegate_cmds & mask) == mask)
 		seq_printf(m, ",delegate_cmds=any");
 	else if (opts->delegate_cmds)
 		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f024caee0bba..93338faa43d5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5302,6 +5302,20 @@  static int bpf_prog_bind_map(union bpf_attr *attr)
 	return ret;
 }
 
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_pathname
+
+static int token_create(union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_TOKEN_CREATE))
+		return -EINVAL;
+
+	/* no flags are supported yet */
+	if (attr->token_create.flags)
+		return -EINVAL;
+
+	return bpf_token_create(attr);
+}
+
 static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
@@ -5435,6 +5449,9 @@  static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 	case BPF_PROG_BIND_MAP:
 		err = bpf_prog_bind_map(&attr);
 		break;
+	case BPF_TOKEN_CREATE:
+		err = token_create(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
new file mode 100644
index 000000000000..f6ea3eddbee6
--- /dev/null
+++ b/kernel/bpf/token.c
@@ -0,0 +1,189 @@ 
+#include <linux/bpf.h>
+#include <linux/vmalloc.h>
+#include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/idr.h>
+#include <linux/namei.h>
+#include <linux/user_namespace.h>
+
+bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+	/* BPF token allows ns_capable() level of capabilities */
+	if (token) {
+		if (ns_capable(token->userns, cap))
+			return true;
+		if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
+			return true;
+	}
+	/* otherwise fallback to capable() checks */
+	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}
+
+void bpf_token_inc(struct bpf_token *token)
+{
+	atomic64_inc(&token->refcnt);
+}
+
+static void bpf_token_free(struct bpf_token *token)
+{
+	put_user_ns(token->userns);
+	kvfree(token);
+}
+
+static void bpf_token_put_deferred(struct work_struct *work)
+{
+	struct bpf_token *token = container_of(work, struct bpf_token, work);
+
+	bpf_token_free(token);
+}
+
+void bpf_token_put(struct bpf_token *token)
+{
+	if (!token)
+		return;
+
+	if (!atomic64_dec_and_test(&token->refcnt))
+		return;
+
+	INIT_WORK(&token->work, bpf_token_put_deferred);
+	schedule_work(&token->work);
+}
+
+static int bpf_token_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_token *token = filp->private_data;
+
+	bpf_token_put(token);
+	return 0;
+}
+
+static ssize_t bpf_dummy_read(struct file *filp, char __user *buf, size_t siz,
+			      loff_t *ppos)
+{
+	/* We need this handler such that alloc_file() enables
+	 * f_mode with FMODE_CAN_READ.
+	 */
+	return -EINVAL;
+}
+
+static ssize_t bpf_dummy_write(struct file *filp, const char __user *buf,
+			       size_t siz, loff_t *ppos)
+{
+	/* We need this handler such that alloc_file() enables
+	 * f_mode with FMODE_CAN_WRITE.
+	 */
+	return -EINVAL;
+}
+
+static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
+{
+	struct bpf_token *token = filp->private_data;
+	u64 mask;
+
+	mask = (1ULL << __MAX_BPF_CMD) - 1;
+	if ((token->allowed_cmds & mask) == mask)
+		seq_printf(m, "allowed_cmds:\tany\n");
+	else
+		seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds);
+}
+
+static const struct file_operations bpf_token_fops = {
+	.release	= bpf_token_release,
+	.read		= bpf_dummy_read,
+	.write		= bpf_dummy_write,
+	.show_fdinfo	= bpf_token_show_fdinfo,
+};
+
+static struct bpf_token *bpf_token_alloc(void)
+{
+	struct bpf_token *token;
+
+	token = kvzalloc(sizeof(*token), GFP_USER);
+	if (!token)
+		return NULL;
+
+	atomic64_set(&token->refcnt, 1);
+
+	return token;
+}
+
+int bpf_token_create(union bpf_attr *attr)
+{
+	struct path path;
+	struct bpf_mount_opts *mnt_opts;
+	struct bpf_token *token;
+	int ret;
+
+	ret = user_path_at(attr->token_create.bpffs_path_fd,
+			   u64_to_user_ptr(attr->token_create.bpffs_pathname),
+			   LOOKUP_FOLLOW | LOOKUP_EMPTY, &path);
+	if (ret)
+		return ret;
+
+	if (path.mnt->mnt_root != path.dentry) {
+		ret = -EINVAL;
+		goto out;
+	}
+	ret = path_permission(&path, MAY_ACCESS);
+	if (ret)
+		goto out;
+
+	token = bpf_token_alloc();
+	if (!token) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* remember bpffs owning userns for future ns_capable() checks */
+	token->userns = get_user_ns(path.dentry->d_sb->s_user_ns);
+
+	mnt_opts = path.dentry->d_sb->s_fs_info;
+	token->allowed_cmds = mnt_opts->delegate_cmds;
+
+	ret = bpf_token_new_fd(token);
+	if (ret < 0)
+		bpf_token_free(token);
+out:
+	path_put(&path);
+	return ret;
+}
+
+#define BPF_TOKEN_INODE_NAME "bpf-token"
+
+/* Alloc anon_inode and FD for prepared token.
+ * Returns fd >= 0 on success; negative error, otherwise.
+ */
+int bpf_token_new_fd(struct bpf_token *token)
+{
+	return anon_inode_getfd(BPF_TOKEN_INODE_NAME, &bpf_token_fops, token, O_CLOEXEC);
+}
+
+struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_token *token;
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_token_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	token = f.file->private_data;
+	bpf_token_inc(token);
+	fdput(f);
+
+	return token;
+}
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
+{
+	if (!token)
+		return false;
+
+	return token->allowed_cmds & (1ULL << cmd);
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 73b155e52204..36e98c6f8944 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -847,6 +847,37 @@  union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality it allows:
+ *		- a set of allowed bpf() syscall commands;
+ *		- a set of allowed BPF map types to be created with
+ *		BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
+ *		- a set of allowed BPF program types and BPF program attach
+ *		types to be loaded with BPF_PROG_LOAD command, if
+ *		BPF_PROG_LOAD itself is allowed.
+ *
+ *		BPF token is created (derived) from an instance of BPF FS,
+ *		assuming it has necessary delegation mount options specified.
+ *		BPF FS mount is specified with openat()-style path FD + string.
+ *		This BPF token can be passed as an extra parameter to various
+ *		bpf() syscall commands to grant BPF subsystem functionality to
+ *		unprivileged processes.
+ *
+ *		When created, BPF token is "associated" with the owning
+ *		user namespace of BPF FS instance (super block) that it was
+ *		derived from, and subsequent BPF operations performed with
+ *		BPF token would be performing capabilities checks (i.e.,
+ *		CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
+ *		that user namespace. Without BPF token, such capabilities
+ *		have to be granted in init user namespace, making bpf()
+ *		syscall incompatible with user namespace, for the most part.
+ *
+ *	Return
+ *		A new file descriptor (a nonnegative integer), or -1 if an
+ *		error occurred (in which case, *errno* is set appropriately).
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -901,6 +932,8 @@  enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
+	__MAX_BPF_CMD,
 };
 
 enum bpf_map_type {
@@ -1694,6 +1727,12 @@  union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		__u32		flags;
+		__u32		bpffs_path_fd;
+		__u64		bpffs_pathname;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF