diff mbox series

[v9,bpf-next,02/17] bpf: add BPF token delegation mount options to BPF FS

Message ID 20231103190523.6353-3-andrii@kernel.org (mailing list archive)
State New, archived
Headers show
Series BPF token and BPF FS-based delegation | expand

Commit Message

Andrii Nakryiko Nov. 3, 2023, 7:05 p.m. UTC
Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
  - `delegate_cmds` allow to specify which bpf() syscall commands are
    allowed with BPF token derived from this BPF FS instance;
  - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
    a set of allowable BPF map types that could be created with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
    a set of allowable BPF program types that could be loaded with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
    a set of allowable BPF program attach types that could be loaded with
    BPF token; delegate_progs and delegate_attachs are meant to be used
    together, as full BPF program type is, in general, determined
    through both program type and program attach type.

Currently, these mount options accept the following forms of values:
  - a special value "any", that enables all possible values of a given
  bit set;
  - numeric value (decimal or hexadecimal, determined by kernel
  automatically) that specifies a bit mask value directly;
  - all the values for a given mount option are combined, if specified
  multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
  delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
  mask.

Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.

Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.

This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container".

There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.

All this is based on suggestions and discussions with Christian Brauner
([0]), to the best of my ability to follow all the implications.

  [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h | 10 ++++++
 kernel/bpf/inode.c  | 88 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 88 insertions(+), 10 deletions(-)

Comments

Christian Brauner Nov. 8, 2023, 1:51 p.m. UTC | #1
On Fri, Nov 03, 2023 at 12:05:08PM -0700, Andrii Nakryiko wrote:
> Add few new mount options to BPF FS that allow to specify that a given
> BPF FS instance allows creation of BPF token (added in the next patch),
> and what sort of operations are allowed under BPF token. As such, we get
> 4 new mount options, each is a bit mask
>   - `delegate_cmds` allow to specify which bpf() syscall commands are
>     allowed with BPF token derived from this BPF FS instance;
>   - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
>     a set of allowable BPF map types that could be created with BPF token;
>   - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
>     a set of allowable BPF program types that could be loaded with BPF token;
>   - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
>     a set of allowable BPF program attach types that could be loaded with
>     BPF token; delegate_progs and delegate_attachs are meant to be used
>     together, as full BPF program type is, in general, determined
>     through both program type and program attach type.
> 
> Currently, these mount options accept the following forms of values:
>   - a special value "any", that enables all possible values of a given
>   bit set;
>   - numeric value (decimal or hexadecimal, determined by kernel
>   automatically) that specifies a bit mask value directly;
>   - all the values for a given mount option are combined, if specified
>   multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
>   delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
>   mask.
> 
> Ideally, more convenient (for humans) symbolic form derived from
> corresponding UAPI enums would be accepted (e.g., `-o
> delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
> it requires a bunch of UAPI header churn, so I postponed it until this
> feature lands upstream or at least there is a definite consensus that
> this feature is acceptable and is going to make it, just to minimize
> amount of wasted effort and not increase amount of non-essential code to
> be reviewed.
> 
> Attentive reader will notice that BPF FS is now marked as
> FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
> user namespace as long as the process has sufficient *namespaced*
> capabilities within that user namespace. But in reality we still
> restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
> init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
> to allow creating BPF FS context object (i.e., fsopen("bpf")) from
> inside unprivileged process inside non-init userns, to capture that
> userns as the owning userns. It will still be required to pass this
> context object back to privileged process to instantiate and mount it.
> 
> This manipulation is important, because capturing non-init userns as the
> owning userns of BPF FS instance (super block) allows to use that userns
> to constraint BPF token to that userns later on (see next patch). So
> creating BPF FS with delegation inside unprivileged userns will restrict
> derived BPF token objects to only "work" inside that intended userns,
> making it scoped to a intended "container".
> 
> There is a set of selftests at the end of the patch set that simulates
> this sequence of steps and validates that everything works as intended.
> But careful review is requested to make sure there are no missed gaps in
> the implementation and testing.
> 
> All this is based on suggestions and discussions with Christian Brauner
> ([0]), to the best of my ability to follow all the implications.

"who will not be held responsible for any CVE future or present as he's
 not sure whether bpf token is a good idea in general"

I'm not opposing it because it's really not my subsystem. But it'd be
nice if you also added a disclaimer that I'm not endorsing this. :)

A comment below.

> 
>   [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/linux/bpf.h | 10 ++++++
>  kernel/bpf/inode.c  | 88 +++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 88 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index b4825d3cdb29..df50a7bf1a77 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1562,6 +1562,16 @@ struct bpf_link_primer {
>  	u32 id;
>  };
>  
> +struct bpf_mount_opts {
> +	umode_t mode;
> +
> +	/* BPF token-related delegation options */
> +	u64 delegate_cmds;
> +	u64 delegate_maps;
> +	u64 delegate_progs;
> +	u64 delegate_attachs;
> +};
> +
>  struct bpf_struct_ops_value;
>  struct btf_member;
>  
> diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
> index 1aafb2ff2e95..e49e93bc65e3 100644
> --- a/kernel/bpf/inode.c
> +++ b/kernel/bpf/inode.c
> @@ -20,6 +20,7 @@
>  #include <linux/filter.h>
>  #include <linux/bpf.h>
>  #include <linux/bpf_trace.h>
> +#include <linux/kstrtox.h>
>  #include "preload/bpf_preload.h"
>  
>  enum bpf_type {
> @@ -599,10 +600,31 @@ EXPORT_SYMBOL(bpf_prog_get_type_path);
>   */
>  static int bpf_show_options(struct seq_file *m, struct dentry *root)
>  {
> +	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
>  	umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
>  
>  	if (mode != S_IRWXUGO)
>  		seq_printf(m, ",mode=%o", mode);
> +
> +	if (opts->delegate_cmds == ~0ULL)
> +		seq_printf(m, ",delegate_cmds=any");
> +	else if (opts->delegate_cmds)
> +		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
> +
> +	if (opts->delegate_maps == ~0ULL)
> +		seq_printf(m, ",delegate_maps=any");
> +	else if (opts->delegate_maps)
> +		seq_printf(m, ",delegate_maps=0x%llx", opts->delegate_maps);
> +
> +	if (opts->delegate_progs == ~0ULL)
> +		seq_printf(m, ",delegate_progs=any");
> +	else if (opts->delegate_progs)
> +		seq_printf(m, ",delegate_progs=0x%llx", opts->delegate_progs);
> +
> +	if (opts->delegate_attachs == ~0ULL)
> +		seq_printf(m, ",delegate_attachs=any");
> +	else if (opts->delegate_attachs)
> +		seq_printf(m, ",delegate_attachs=0x%llx", opts->delegate_attachs);
>  	return 0;
>  }
>  
> @@ -626,22 +648,27 @@ static const struct super_operations bpf_super_ops = {
>  
>  enum {
>  	OPT_MODE,
> +	OPT_DELEGATE_CMDS,
> +	OPT_DELEGATE_MAPS,
> +	OPT_DELEGATE_PROGS,
> +	OPT_DELEGATE_ATTACHS,
>  };
>  
>  static const struct fs_parameter_spec bpf_fs_parameters[] = {
>  	fsparam_u32oct	("mode",			OPT_MODE),
> +	fsparam_string	("delegate_cmds",		OPT_DELEGATE_CMDS),
> +	fsparam_string	("delegate_maps",		OPT_DELEGATE_MAPS),
> +	fsparam_string	("delegate_progs",		OPT_DELEGATE_PROGS),
> +	fsparam_string	("delegate_attachs",		OPT_DELEGATE_ATTACHS),
>  	{}
>  };
>  
> -struct bpf_mount_opts {
> -	umode_t mode;
> -};
> -
>  static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  {
> -	struct bpf_mount_opts *opts = fc->fs_private;
> +	struct bpf_mount_opts *opts = fc->s_fs_info;
>  	struct fs_parse_result result;
> -	int opt;
> +	int opt, err;
> +	u64 msk;
>  
>  	opt = fs_parse(fc, bpf_fs_parameters, param, &result);
>  	if (opt < 0) {
> @@ -665,6 +692,25 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  	case OPT_MODE:
>  		opts->mode = result.uint_32 & S_IALLUGO;
>  		break;
> +	case OPT_DELEGATE_CMDS:
> +	case OPT_DELEGATE_MAPS:
> +	case OPT_DELEGATE_PROGS:
> +	case OPT_DELEGATE_ATTACHS:
> +		if (strcmp(param->string, "any") == 0) {
> +			msk = ~0ULL;
> +		} else {
> +			err = kstrtou64(param->string, 0, &msk);
> +			if (err)
> +				return err;
> +		}
> +		switch (opt) {
> +		case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break;
> +		case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break;
> +		case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break;
> +		case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break;
> +		default: return -EINVAL;
> +		}
> +		break;
>  	}

So just to repeat that this will allow a container to set it's own
delegation options:

        # unprivileged container

        fd_fs = fsopen();
        fsconfig(fd_fs, FSCONFIG_BLA_BLA, "give-me-all-the-delegation");

        # Now hand of that fd_fs to a privileged process

        fsconfig(fd_fs, FSCONFIG_CREATE_CMD, ...)

This means the container manager can't be part of your threat model
because you need to trust it to set delegation options.

But if the container manager is part of your threat model then you can
never trust an fd_fs handed to you because the container manager might
have enabled arbitrary delegation privileges.

There's ways around this:

(1) kernel: Account for this in the kernel and require privileges when
    setting delegation options.
(2) userspace: A trusted helper that allocates an fs_context fd in
    the target user namespace, then sets delegation options and creates
    superblock.

(1) Is more restrictive but also more secure. (2) is less restrictive
but requires more care from userspace.

Either way I would probably consider writing a document detailing
various delegation scenarios and possible pitfalls and implications
before advertising it.

If you choose (2) then you also need to be aware that the security of
this also hinges on bpffs not allowing to reconfigure parameters once it
has been mounted. Otherwise an unprivileged container can change
delegation options.

I would recommend that you either add a dummy bpf_reconfigure() method
with a comment in it or you add a comment on top of bpf_context_ops.
Something like:

/*
 * Unprivileged mounts of bpffs are owned by the user namespace they are
 * mounted in. That means unprivileged users can change vfs mount
 * options (ro<->rw, nosuid, etc.).
 *
 * They currently cannot change bpffs specific mount options such as
 * delegation settings. If that is ever implemented it is necessary to
 * require rivileges in the initial namespace. Otherwise unprivileged
 * users can change delegation options to whatever they want.
 */
Andrii Nakryiko Nov. 8, 2023, 9:09 p.m. UTC | #2
On Wed, Nov 8, 2023 at 5:51 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Nov 03, 2023 at 12:05:08PM -0700, Andrii Nakryiko wrote:
> > Add few new mount options to BPF FS that allow to specify that a given
> > BPF FS instance allows creation of BPF token (added in the next patch),
> > and what sort of operations are allowed under BPF token. As such, we get
> > 4 new mount options, each is a bit mask
> >   - `delegate_cmds` allow to specify which bpf() syscall commands are
> >     allowed with BPF token derived from this BPF FS instance;
> >   - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
> >     a set of allowable BPF map types that could be created with BPF token;
> >   - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
> >     a set of allowable BPF program types that could be loaded with BPF token;
> >   - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
> >     a set of allowable BPF program attach types that could be loaded with
> >     BPF token; delegate_progs and delegate_attachs are meant to be used
> >     together, as full BPF program type is, in general, determined
> >     through both program type and program attach type.
> >
> > Currently, these mount options accept the following forms of values:
> >   - a special value "any", that enables all possible values of a given
> >   bit set;
> >   - numeric value (decimal or hexadecimal, determined by kernel
> >   automatically) that specifies a bit mask value directly;
> >   - all the values for a given mount option are combined, if specified
> >   multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
> >   delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
> >   mask.
> >
> > Ideally, more convenient (for humans) symbolic form derived from
> > corresponding UAPI enums would be accepted (e.g., `-o
> > delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
> > it requires a bunch of UAPI header churn, so I postponed it until this
> > feature lands upstream or at least there is a definite consensus that
> > this feature is acceptable and is going to make it, just to minimize
> > amount of wasted effort and not increase amount of non-essential code to
> > be reviewed.
> >
> > Attentive reader will notice that BPF FS is now marked as
> > FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
> > user namespace as long as the process has sufficient *namespaced*
> > capabilities within that user namespace. But in reality we still
> > restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
> > init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
> > to allow creating BPF FS context object (i.e., fsopen("bpf")) from
> > inside unprivileged process inside non-init userns, to capture that
> > userns as the owning userns. It will still be required to pass this
> > context object back to privileged process to instantiate and mount it.
> >
> > This manipulation is important, because capturing non-init userns as the
> > owning userns of BPF FS instance (super block) allows to use that userns
> > to constraint BPF token to that userns later on (see next patch). So
> > creating BPF FS with delegation inside unprivileged userns will restrict
> > derived BPF token objects to only "work" inside that intended userns,
> > making it scoped to a intended "container".
> >
> > There is a set of selftests at the end of the patch set that simulates
> > this sequence of steps and validates that everything works as intended.
> > But careful review is requested to make sure there are no missed gaps in
> > the implementation and testing.
> >
> > All this is based on suggestions and discussions with Christian Brauner
> > ([0]), to the best of my ability to follow all the implications.
>
> "who will not be held responsible for any CVE future or present as he's
>  not sure whether bpf token is a good idea in general"
>
> I'm not opposing it because it's really not my subsystem. But it'd be
> nice if you also added a disclaimer that I'm not endorsing this. :)
>

Sure, I'll clarify. I still appreciate your reviewing everything and
pointing out all the gotchas (like the reconfiguration and other
stuff), thanks!

> A comment below.
>
> >
> >   [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  include/linux/bpf.h | 10 ++++++
> >  kernel/bpf/inode.c  | 88 +++++++++++++++++++++++++++++++++++++++------
> >  2 files changed, 88 insertions(+), 10 deletions(-)
> >

[...]

> >       opt = fs_parse(fc, bpf_fs_parameters, param, &result);
> >       if (opt < 0) {
> > @@ -665,6 +692,25 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >       case OPT_MODE:
> >               opts->mode = result.uint_32 & S_IALLUGO;
> >               break;
> > +     case OPT_DELEGATE_CMDS:
> > +     case OPT_DELEGATE_MAPS:
> > +     case OPT_DELEGATE_PROGS:
> > +     case OPT_DELEGATE_ATTACHS:
> > +             if (strcmp(param->string, "any") == 0) {
> > +                     msk = ~0ULL;
> > +             } else {
> > +                     err = kstrtou64(param->string, 0, &msk);
> > +                     if (err)
> > +                             return err;
> > +             }
> > +             switch (opt) {
> > +             case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break;
> > +             case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break;
> > +             case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break;
> > +             case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break;
> > +             default: return -EINVAL;
> > +             }
> > +             break;
> >       }
>
> So just to repeat that this will allow a container to set it's own
> delegation options:
>
>         # unprivileged container
>
>         fd_fs = fsopen();
>         fsconfig(fd_fs, FSCONFIG_BLA_BLA, "give-me-all-the-delegation");
>
>         # Now hand of that fd_fs to a privileged process
>
>         fsconfig(fd_fs, FSCONFIG_CREATE_CMD, ...)
>
> This means the container manager can't be part of your threat model
> because you need to trust it to set delegation options.
>
> But if the container manager is part of your threat model then you can
> never trust an fd_fs handed to you because the container manager might
> have enabled arbitrary delegation privileges.
>
> There's ways around this:
>
> (1) kernel: Account for this in the kernel and require privileges when
>     setting delegation options.

What sort of privilege would that be? We are in an unprivileged user
namespace, so that would have to be some ns_capable() checks or
something? I can add ns_capable(CAP_BPF), but what else did you have
in mind?

I think even if we say that privileged parent does FSCONFIG_SET_STRING
and unprivileged child just does sys_fsopen("bpf", 0) and nothing
more, we still can't be sure that child won't race with parent and set
FSCONFIG_SET_STRING at the same time. Because they both have access to
the same fs_fd.

> (2) userspace: A trusted helper that allocates an fs_context fd in
>     the target user namespace, then sets delegation options and creates
>     superblock.
>
> (1) Is more restrictive but also more secure. (2) is less restrictive
> but requires more care from userspace.
>
> Either way I would probably consider writing a document detailing
> various delegation scenarios and possible pitfalls and implications
> before advertising it.
>
> If you choose (2) then you also need to be aware that the security of
> this also hinges on bpffs not allowing to reconfigure parameters once it
> has been mounted. Otherwise an unprivileged container can change
> delegation options.
>
> I would recommend that you either add a dummy bpf_reconfigure() method
> with a comment in it or you add a comment on top of bpf_context_ops.
> Something like:
>
> /*
>  * Unprivileged mounts of bpffs are owned by the user namespace they are
>  * mounted in. That means unprivileged users can change vfs mount
>  * options (ro<->rw, nosuid, etc.).
>  *
>  * They currently cannot change bpffs specific mount options such as
>  * delegation settings. If that is ever implemented it is necessary to
>  * require rivileges in the initial namespace. Otherwise unprivileged
>  * users can change delegation options to whatever they want.
>  */

Yep, I will add a custom callback. I think we can allow reconfiguring
towards less permissive delegation subset, but I'll need to look at
the specifics to see if we can support that easily.
Christian Brauner Nov. 9, 2023, 8:47 a.m. UTC | #3
On Wed, Nov 08, 2023 at 01:09:27PM -0800, Andrii Nakryiko wrote:
> On Wed, Nov 8, 2023 at 5:51 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Fri, Nov 03, 2023 at 12:05:08PM -0700, Andrii Nakryiko wrote:
> > > Add few new mount options to BPF FS that allow to specify that a given
> > > BPF FS instance allows creation of BPF token (added in the next patch),
> > > and what sort of operations are allowed under BPF token. As such, we get
> > > 4 new mount options, each is a bit mask
> > >   - `delegate_cmds` allow to specify which bpf() syscall commands are
> > >     allowed with BPF token derived from this BPF FS instance;
> > >   - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
> > >     a set of allowable BPF map types that could be created with BPF token;
> > >   - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
> > >     a set of allowable BPF program types that could be loaded with BPF token;
> > >   - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
> > >     a set of allowable BPF program attach types that could be loaded with
> > >     BPF token; delegate_progs and delegate_attachs are meant to be used
> > >     together, as full BPF program type is, in general, determined
> > >     through both program type and program attach type.
> > >
> > > Currently, these mount options accept the following forms of values:
> > >   - a special value "any", that enables all possible values of a given
> > >   bit set;
> > >   - numeric value (decimal or hexadecimal, determined by kernel
> > >   automatically) that specifies a bit mask value directly;
> > >   - all the values for a given mount option are combined, if specified
> > >   multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
> > >   delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
> > >   mask.
> > >
> > > Ideally, more convenient (for humans) symbolic form derived from
> > > corresponding UAPI enums would be accepted (e.g., `-o
> > > delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
> > > it requires a bunch of UAPI header churn, so I postponed it until this
> > > feature lands upstream or at least there is a definite consensus that
> > > this feature is acceptable and is going to make it, just to minimize
> > > amount of wasted effort and not increase amount of non-essential code to
> > > be reviewed.
> > >
> > > Attentive reader will notice that BPF FS is now marked as
> > > FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
> > > user namespace as long as the process has sufficient *namespaced*
> > > capabilities within that user namespace. But in reality we still
> > > restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
> > > init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
> > > to allow creating BPF FS context object (i.e., fsopen("bpf")) from
> > > inside unprivileged process inside non-init userns, to capture that
> > > userns as the owning userns. It will still be required to pass this
> > > context object back to privileged process to instantiate and mount it.
> > >
> > > This manipulation is important, because capturing non-init userns as the
> > > owning userns of BPF FS instance (super block) allows to use that userns
> > > to constraint BPF token to that userns later on (see next patch). So
> > > creating BPF FS with delegation inside unprivileged userns will restrict
> > > derived BPF token objects to only "work" inside that intended userns,
> > > making it scoped to a intended "container".
> > >
> > > There is a set of selftests at the end of the patch set that simulates
> > > this sequence of steps and validates that everything works as intended.
> > > But careful review is requested to make sure there are no missed gaps in
> > > the implementation and testing.
> > >
> > > All this is based on suggestions and discussions with Christian Brauner
> > > ([0]), to the best of my ability to follow all the implications.
> >
> > "who will not be held responsible for any CVE future or present as he's
> >  not sure whether bpf token is a good idea in general"
> >
> > I'm not opposing it because it's really not my subsystem. But it'd be
> > nice if you also added a disclaimer that I'm not endorsing this. :)
> >
> 
> Sure, I'll clarify. I still appreciate your reviewing everything and
> pointing out all the gotchas (like the reconfiguration and other
> stuff), thanks!
> 
> > A comment below.
> >
> > >
> > >   [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  include/linux/bpf.h | 10 ++++++
> > >  kernel/bpf/inode.c  | 88 +++++++++++++++++++++++++++++++++++++++------
> > >  2 files changed, 88 insertions(+), 10 deletions(-)
> > >
> 
> [...]
> 
> > >       opt = fs_parse(fc, bpf_fs_parameters, param, &result);
> > >       if (opt < 0) {
> > > @@ -665,6 +692,25 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > >       case OPT_MODE:
> > >               opts->mode = result.uint_32 & S_IALLUGO;
> > >               break;
> > > +     case OPT_DELEGATE_CMDS:
> > > +     case OPT_DELEGATE_MAPS:
> > > +     case OPT_DELEGATE_PROGS:
> > > +     case OPT_DELEGATE_ATTACHS:
> > > +             if (strcmp(param->string, "any") == 0) {
> > > +                     msk = ~0ULL;
> > > +             } else {
> > > +                     err = kstrtou64(param->string, 0, &msk);
> > > +                     if (err)
> > > +                             return err;
> > > +             }
> > > +             switch (opt) {
> > > +             case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break;
> > > +             case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break;
> > > +             case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break;
> > > +             case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break;
> > > +             default: return -EINVAL;
> > > +             }
> > > +             break;
> > >       }
> >
> > So just to repeat that this will allow a container to set it's own
> > delegation options:
> >
> >         # unprivileged container
> >
> >         fd_fs = fsopen();
> >         fsconfig(fd_fs, FSCONFIG_BLA_BLA, "give-me-all-the-delegation");
> >
> >         # Now hand of that fd_fs to a privileged process
> >
> >         fsconfig(fd_fs, FSCONFIG_CREATE_CMD, ...)
> >
> > This means the container manager can't be part of your threat model
> > because you need to trust it to set delegation options.
> >
> > But if the container manager is part of your threat model then you can
> > never trust an fd_fs handed to you because the container manager might
> > have enabled arbitrary delegation privileges.
> >
> > There's ways around this:
> >
> > (1) kernel: Account for this in the kernel and require privileges when
> >     setting delegation options.
> 
> What sort of privilege would that be? We are in an unprivileged user
> namespace, so that would have to be some ns_capable() checks or
> something? I can add ns_capable(CAP_BPF), but what else did you have
> in mind?

You would require privileges in the initial namespace aka capable()
checks similar to what you require for superblock creation.

> 
> I think even if we say that privileged parent does FSCONFIG_SET_STRING
> and unprivileged child just does sys_fsopen("bpf", 0) and nothing
> more, we still can't be sure that child won't race with parent and set
> FSCONFIG_SET_STRING at the same time. Because they both have access to
> the same fs_fd.

Unless you require privileges as outlined above to set delegation
options in which case an unprivileged container cannot change delegation
options at all.

> 
> > (2) userspace: A trusted helper that allocates an fs_context fd in
> >     the target user namespace, then sets delegation options and creates
> >     superblock.
> >
> > (1) Is more restrictive but also more secure. (2) is less restrictive
> > but requires more care from userspace.
> >
> > Either way I would probably consider writing a document detailing
> > various delegation scenarios and possible pitfalls and implications
> > before advertising it.
> >
> > If you choose (2) then you also need to be aware that the security of
> > this also hinges on bpffs not allowing to reconfigure parameters once it
> > has been mounted. Otherwise an unprivileged container can change
> > delegation options.
> >
> > I would recommend that you either add a dummy bpf_reconfigure() method
> > with a comment in it or you add a comment on top of bpf_context_ops.
> > Something like:
> >
> > /*
> >  * Unprivileged mounts of bpffs are owned by the user namespace they are
> >  * mounted in. That means unprivileged users can change vfs mount
> >  * options (ro<->rw, nosuid, etc.).
> >  *
> >  * They currently cannot change bpffs specific mount options such as
> >  * delegation settings. If that is ever implemented it is necessary to
> >  * require rivileges in the initial namespace. Otherwise unprivileged
> >  * users can change delegation options to whatever they want.
> >  */
> 
> Yep, I will add a custom callback. I think we can allow reconfiguring
> towards less permissive delegation subset, but I'll need to look at
> the specifics to see if we can support that easily.
Andrii Nakryiko Nov. 9, 2023, 5:09 p.m. UTC | #4
On Thu, Nov 9, 2023 at 12:48 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Wed, Nov 08, 2023 at 01:09:27PM -0800, Andrii Nakryiko wrote:
> > On Wed, Nov 8, 2023 at 5:51 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Fri, Nov 03, 2023 at 12:05:08PM -0700, Andrii Nakryiko wrote:
> > > > Add few new mount options to BPF FS that allow to specify that a given
> > > > BPF FS instance allows creation of BPF token (added in the next patch),
> > > > and what sort of operations are allowed under BPF token. As such, we get
> > > > 4 new mount options, each is a bit mask
> > > >   - `delegate_cmds` allow to specify which bpf() syscall commands are
> > > >     allowed with BPF token derived from this BPF FS instance;
> > > >   - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
> > > >     a set of allowable BPF map types that could be created with BPF token;
> > > >   - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
> > > >     a set of allowable BPF program types that could be loaded with BPF token;
> > > >   - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
> > > >     a set of allowable BPF program attach types that could be loaded with
> > > >     BPF token; delegate_progs and delegate_attachs are meant to be used
> > > >     together, as full BPF program type is, in general, determined
> > > >     through both program type and program attach type.
> > > >
> > > > Currently, these mount options accept the following forms of values:
> > > >   - a special value "any", that enables all possible values of a given
> > > >   bit set;
> > > >   - numeric value (decimal or hexadecimal, determined by kernel
> > > >   automatically) that specifies a bit mask value directly;
> > > >   - all the values for a given mount option are combined, if specified
> > > >   multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
> > > >   delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
> > > >   mask.
> > > >
> > > > Ideally, more convenient (for humans) symbolic form derived from
> > > > corresponding UAPI enums would be accepted (e.g., `-o
> > > > delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
> > > > it requires a bunch of UAPI header churn, so I postponed it until this
> > > > feature lands upstream or at least there is a definite consensus that
> > > > this feature is acceptable and is going to make it, just to minimize
> > > > amount of wasted effort and not increase amount of non-essential code to
> > > > be reviewed.
> > > >
> > > > Attentive reader will notice that BPF FS is now marked as
> > > > FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
> > > > user namespace as long as the process has sufficient *namespaced*
> > > > capabilities within that user namespace. But in reality we still
> > > > restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
> > > > init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
> > > > to allow creating BPF FS context object (i.e., fsopen("bpf")) from
> > > > inside unprivileged process inside non-init userns, to capture that
> > > > userns as the owning userns. It will still be required to pass this
> > > > context object back to privileged process to instantiate and mount it.
> > > >
> > > > This manipulation is important, because capturing non-init userns as the
> > > > owning userns of BPF FS instance (super block) allows to use that userns
> > > > to constraint BPF token to that userns later on (see next patch). So
> > > > creating BPF FS with delegation inside unprivileged userns will restrict
> > > > derived BPF token objects to only "work" inside that intended userns,
> > > > making it scoped to a intended "container".
> > > >
> > > > There is a set of selftests at the end of the patch set that simulates
> > > > this sequence of steps and validates that everything works as intended.
> > > > But careful review is requested to make sure there are no missed gaps in
> > > > the implementation and testing.
> > > >
> > > > All this is based on suggestions and discussions with Christian Brauner
> > > > ([0]), to the best of my ability to follow all the implications.
> > >
> > > "who will not be held responsible for any CVE future or present as he's
> > >  not sure whether bpf token is a good idea in general"
> > >
> > > I'm not opposing it because it's really not my subsystem. But it'd be
> > > nice if you also added a disclaimer that I'm not endorsing this. :)
> > >
> >
> > Sure, I'll clarify. I still appreciate your reviewing everything and
> > pointing out all the gotchas (like the reconfiguration and other
> > stuff), thanks!
> >
> > > A comment below.
> > >
> > > >
> > > >   [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  include/linux/bpf.h | 10 ++++++
> > > >  kernel/bpf/inode.c  | 88 +++++++++++++++++++++++++++++++++++++++------
> > > >  2 files changed, 88 insertions(+), 10 deletions(-)
> > > >
> >
> > [...]
> >
> > > >       opt = fs_parse(fc, bpf_fs_parameters, param, &result);
> > > >       if (opt < 0) {
> > > > @@ -665,6 +692,25 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > > >       case OPT_MODE:
> > > >               opts->mode = result.uint_32 & S_IALLUGO;
> > > >               break;
> > > > +     case OPT_DELEGATE_CMDS:
> > > > +     case OPT_DELEGATE_MAPS:
> > > > +     case OPT_DELEGATE_PROGS:
> > > > +     case OPT_DELEGATE_ATTACHS:
> > > > +             if (strcmp(param->string, "any") == 0) {
> > > > +                     msk = ~0ULL;
> > > > +             } else {
> > > > +                     err = kstrtou64(param->string, 0, &msk);
> > > > +                     if (err)
> > > > +                             return err;
> > > > +             }
> > > > +             switch (opt) {
> > > > +             case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break;
> > > > +             case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break;
> > > > +             case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break;
> > > > +             case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break;
> > > > +             default: return -EINVAL;
> > > > +             }
> > > > +             break;
> > > >       }
> > >
> > > So just to repeat that this will allow a container to set it's own
> > > delegation options:
> > >
> > >         # unprivileged container
> > >
> > >         fd_fs = fsopen();
> > >         fsconfig(fd_fs, FSCONFIG_BLA_BLA, "give-me-all-the-delegation");
> > >
> > >         # Now hand of that fd_fs to a privileged process
> > >
> > >         fsconfig(fd_fs, FSCONFIG_CREATE_CMD, ...)
> > >
> > > This means the container manager can't be part of your threat model
> > > because you need to trust it to set delegation options.
> > >
> > > But if the container manager is part of your threat model then you can
> > > never trust an fd_fs handed to you because the container manager might
> > > have enabled arbitrary delegation privileges.
> > >
> > > There's ways around this:
> > >
> > > (1) kernel: Account for this in the kernel and require privileges when
> > >     setting delegation options.
> >
> > What sort of privilege would that be? We are in an unprivileged user
> > namespace, so that would have to be some ns_capable() checks or
> > something? I can add ns_capable(CAP_BPF), but what else did you have
> > in mind?
>
> You would require privileges in the initial namespace aka capable()
> checks similar to what you require for superblock creation.

ok, I was just wondering if I'm missing something non-obvious.
capable(CAP_SYS_ADMIN) makes sense and doesn't really hurt intended
use case. Privileged parent will set these config values and then do
FSCONFIG_CREATE_CMD.

For reconfiguration I'll enforce same capable(CAP_SYS_ADMIN) checks,
unless unprivileged user drops permissions to more restrictive ones
(but I haven't had a chance to look at exact callback API, so we'll
see if that's easy to support).

Thanks for feedback!

>
> >
> > I think even if we say that privileged parent does FSCONFIG_SET_STRING
> > and unprivileged child just does sys_fsopen("bpf", 0) and nothing
> > more, we still can't be sure that child won't race with parent and set
> > FSCONFIG_SET_STRING at the same time. Because they both have access to
> > the same fs_fd.
>
> Unless you require privileges as outlined above to set delegation
> options in which case an unprivileged container cannot change delegation
> options at all.

Yep, makes sense, that's what I'm going to do.

>
> >
> > > (2) userspace: A trusted helper that allocates an fs_context fd in
> > >     the target user namespace, then sets delegation options and creates
> > >     superblock.
> > >
> > > (1) Is more restrictive but also more secure. (2) is less restrictive
> > > but requires more care from userspace.
> > >
> > > Either way I would probably consider writing a document detailing
> > > various delegation scenarios and possible pitfalls and implications
> > > before advertising it.
> > >
> > > If you choose (2) then you also need to be aware that the security of
> > > this also hinges on bpffs not allowing to reconfigure parameters once it
> > > has been mounted. Otherwise an unprivileged container can change
> > > delegation options.
> > >
> > > I would recommend that you either add a dummy bpf_reconfigure() method
> > > with a comment in it or you add a comment on top of bpf_context_ops.
> > > Something like:
> > >
> > > /*
> > >  * Unprivileged mounts of bpffs are owned by the user namespace they are
> > >  * mounted in. That means unprivileged users can change vfs mount
> > >  * options (ro<->rw, nosuid, etc.).
> > >  *
> > >  * They currently cannot change bpffs specific mount options such as
> > >  * delegation settings. If that is ever implemented it is necessary to
> > >  * require rivileges in the initial namespace. Otherwise unprivileged
> > >  * users can change delegation options to whatever they want.
> > >  */
> >
> > Yep, I will add a custom callback. I think we can allow reconfiguring
> > towards less permissive delegation subset, but I'll need to look at
> > the specifics to see if we can support that easily.
Andrii Nakryiko Nov. 9, 2023, 10:29 p.m. UTC | #5
On Thu, Nov 9, 2023 at 9:09 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Nov 9, 2023 at 12:48 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Wed, Nov 08, 2023 at 01:09:27PM -0800, Andrii Nakryiko wrote:
> > > On Wed, Nov 8, 2023 at 5:51 AM Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > On Fri, Nov 03, 2023 at 12:05:08PM -0700, Andrii Nakryiko wrote:
> > > > > Add few new mount options to BPF FS that allow to specify that a given
> > > > > BPF FS instance allows creation of BPF token (added in the next patch),
> > > > > and what sort of operations are allowed under BPF token. As such, we get
> > > > > 4 new mount options, each is a bit mask
> > > > >   - `delegate_cmds` allow to specify which bpf() syscall commands are
> > > > >     allowed with BPF token derived from this BPF FS instance;
> > > > >   - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
> > > > >     a set of allowable BPF map types that could be created with BPF token;
> > > > >   - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
> > > > >     a set of allowable BPF program types that could be loaded with BPF token;
> > > > >   - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
> > > > >     a set of allowable BPF program attach types that could be loaded with
> > > > >     BPF token; delegate_progs and delegate_attachs are meant to be used
> > > > >     together, as full BPF program type is, in general, determined
> > > > >     through both program type and program attach type.
> > > > >
> > > > > Currently, these mount options accept the following forms of values:
> > > > >   - a special value "any", that enables all possible values of a given
> > > > >   bit set;
> > > > >   - numeric value (decimal or hexadecimal, determined by kernel
> > > > >   automatically) that specifies a bit mask value directly;
> > > > >   - all the values for a given mount option are combined, if specified
> > > > >   multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
> > > > >   delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
> > > > >   mask.
> > > > >
> > > > > Ideally, more convenient (for humans) symbolic form derived from
> > > > > corresponding UAPI enums would be accepted (e.g., `-o
> > > > > delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
> > > > > it requires a bunch of UAPI header churn, so I postponed it until this
> > > > > feature lands upstream or at least there is a definite consensus that
> > > > > this feature is acceptable and is going to make it, just to minimize
> > > > > amount of wasted effort and not increase amount of non-essential code to
> > > > > be reviewed.
> > > > >
> > > > > Attentive reader will notice that BPF FS is now marked as
> > > > > FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
> > > > > user namespace as long as the process has sufficient *namespaced*
> > > > > capabilities within that user namespace. But in reality we still
> > > > > restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
> > > > > init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
> > > > > to allow creating BPF FS context object (i.e., fsopen("bpf")) from
> > > > > inside unprivileged process inside non-init userns, to capture that
> > > > > userns as the owning userns. It will still be required to pass this
> > > > > context object back to privileged process to instantiate and mount it.
> > > > >
> > > > > This manipulation is important, because capturing non-init userns as the
> > > > > owning userns of BPF FS instance (super block) allows to use that userns
> > > > > to constraint BPF token to that userns later on (see next patch). So
> > > > > creating BPF FS with delegation inside unprivileged userns will restrict
> > > > > derived BPF token objects to only "work" inside that intended userns,
> > > > > making it scoped to a intended "container".
> > > > >
> > > > > There is a set of selftests at the end of the patch set that simulates
> > > > > this sequence of steps and validates that everything works as intended.
> > > > > But careful review is requested to make sure there are no missed gaps in
> > > > > the implementation and testing.
> > > > >
> > > > > All this is based on suggestions and discussions with Christian Brauner
> > > > > ([0]), to the best of my ability to follow all the implications.
> > > >
> > > > "who will not be held responsible for any CVE future or present as he's
> > > >  not sure whether bpf token is a good idea in general"
> > > >
> > > > I'm not opposing it because it's really not my subsystem. But it'd be
> > > > nice if you also added a disclaimer that I'm not endorsing this. :)
> > > >
> > >
> > > Sure, I'll clarify. I still appreciate your reviewing everything and
> > > pointing out all the gotchas (like the reconfiguration and other
> > > stuff), thanks!
> > >
> > > > A comment below.
> > > >
> > > > >
> > > > >   [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
> > > > >
> > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > > ---
> > > > >  include/linux/bpf.h | 10 ++++++
> > > > >  kernel/bpf/inode.c  | 88 +++++++++++++++++++++++++++++++++++++++------
> > > > >  2 files changed, 88 insertions(+), 10 deletions(-)
> > > > >
> > >
> > > [...]
> > >
> > > > >       opt = fs_parse(fc, bpf_fs_parameters, param, &result);
> > > > >       if (opt < 0) {
> > > > > @@ -665,6 +692,25 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > > > >       case OPT_MODE:
> > > > >               opts->mode = result.uint_32 & S_IALLUGO;
> > > > >               break;
> > > > > +     case OPT_DELEGATE_CMDS:
> > > > > +     case OPT_DELEGATE_MAPS:
> > > > > +     case OPT_DELEGATE_PROGS:
> > > > > +     case OPT_DELEGATE_ATTACHS:
> > > > > +             if (strcmp(param->string, "any") == 0) {
> > > > > +                     msk = ~0ULL;
> > > > > +             } else {
> > > > > +                     err = kstrtou64(param->string, 0, &msk);
> > > > > +                     if (err)
> > > > > +                             return err;
> > > > > +             }
> > > > > +             switch (opt) {
> > > > > +             case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break;
> > > > > +             case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break;
> > > > > +             case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break;
> > > > > +             case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break;
> > > > > +             default: return -EINVAL;
> > > > > +             }
> > > > > +             break;
> > > > >       }
> > > >
> > > > So just to repeat that this will allow a container to set it's own
> > > > delegation options:
> > > >
> > > >         # unprivileged container
> > > >
> > > >         fd_fs = fsopen();
> > > >         fsconfig(fd_fs, FSCONFIG_BLA_BLA, "give-me-all-the-delegation");
> > > >
> > > >         # Now hand of that fd_fs to a privileged process
> > > >
> > > >         fsconfig(fd_fs, FSCONFIG_CREATE_CMD, ...)
> > > >
> > > > This means the container manager can't be part of your threat model
> > > > because you need to trust it to set delegation options.
> > > >
> > > > But if the container manager is part of your threat model then you can
> > > > never trust an fd_fs handed to you because the container manager might
> > > > have enabled arbitrary delegation privileges.
> > > >
> > > > There's ways around this:
> > > >
> > > > (1) kernel: Account for this in the kernel and require privileges when
> > > >     setting delegation options.
> > >
> > > What sort of privilege would that be? We are in an unprivileged user
> > > namespace, so that would have to be some ns_capable() checks or
> > > something? I can add ns_capable(CAP_BPF), but what else did you have
> > > in mind?
> >
> > You would require privileges in the initial namespace aka capable()
> > checks similar to what you require for superblock creation.
>
> ok, I was just wondering if I'm missing something non-obvious.
> capable(CAP_SYS_ADMIN) makes sense and doesn't really hurt intended
> use case. Privileged parent will set these config values and then do
> FSCONFIG_CREATE_CMD.
>
> For reconfiguration I'll enforce same capable(CAP_SYS_ADMIN) checks,
> unless unprivileged user drops permissions to more restrictive ones
> (but I haven't had a chance to look at exact callback API, so we'll
> see if that's easy to support).

Ok, so I played with this a bit. It seems that if I require
capable(CAP_SYS_ADMIN) for in fsconfig() to set delegation options, I
don't have to do anything special about reconfiguration. Any
FSCONFIG_SET_xxx command for delegation option will just fail, and so
reconfiguration is harmless. I'm going to go with that and keep it
simple.

>
> Thanks for feedback!
>
> >
> > >
> > > I think even if we say that privileged parent does FSCONFIG_SET_STRING
> > > and unprivileged child just does sys_fsopen("bpf", 0) and nothing
> > > more, we still can't be sure that child won't race with parent and set
> > > FSCONFIG_SET_STRING at the same time. Because they both have access to
> > > the same fs_fd.
> >
> > Unless you require privileges as outlined above to set delegation
> > options in which case an unprivileged container cannot change delegation
> > options at all.
>
> Yep, makes sense, that's what I'm going to do.
>
> >
> > >
> > > > (2) userspace: A trusted helper that allocates an fs_context fd in
> > > >     the target user namespace, then sets delegation options and creates
> > > >     superblock.
> > > >
> > > > (1) Is more restrictive but also more secure. (2) is less restrictive
> > > > but requires more care from userspace.
> > > >
> > > > Either way I would probably consider writing a document detailing
> > > > various delegation scenarios and possible pitfalls and implications
> > > > before advertising it.
> > > >
> > > > If you choose (2) then you also need to be aware that the security of
> > > > this also hinges on bpffs not allowing to reconfigure parameters once it
> > > > has been mounted. Otherwise an unprivileged container can change
> > > > delegation options.
> > > >
> > > > I would recommend that you either add a dummy bpf_reconfigure() method
> > > > with a comment in it or you add a comment on top of bpf_context_ops.
> > > > Something like:
> > > >
> > > > /*
> > > >  * Unprivileged mounts of bpffs are owned by the user namespace they are
> > > >  * mounted in. That means unprivileged users can change vfs mount
> > > >  * options (ro<->rw, nosuid, etc.).
> > > >  *
> > > >  * They currently cannot change bpffs specific mount options such as
> > > >  * delegation settings. If that is ever implemented it is necessary to
> > > >  * require rivileges in the initial namespace. Otherwise unprivileged
> > > >  * users can change delegation options to whatever they want.
> > > >  */
> > >
> > > Yep, I will add a custom callback. I think we can allow reconfiguring
> > > towards less permissive delegation subset, but I'll need to look at
> > > the specifics to see if we can support that easily.
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b4825d3cdb29..df50a7bf1a77 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1562,6 +1562,16 @@  struct bpf_link_primer {
 	u32 id;
 };
 
+struct bpf_mount_opts {
+	umode_t mode;
+
+	/* BPF token-related delegation options */
+	u64 delegate_cmds;
+	u64 delegate_maps;
+	u64 delegate_progs;
+	u64 delegate_attachs;
+};
+
 struct bpf_struct_ops_value;
 struct btf_member;
 
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 1aafb2ff2e95..e49e93bc65e3 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -20,6 +20,7 @@ 
 #include <linux/filter.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
+#include <linux/kstrtox.h>
 #include "preload/bpf_preload.h"
 
 enum bpf_type {
@@ -599,10 +600,31 @@  EXPORT_SYMBOL(bpf_prog_get_type_path);
  */
 static int bpf_show_options(struct seq_file *m, struct dentry *root)
 {
+	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
 	umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
 
 	if (mode != S_IRWXUGO)
 		seq_printf(m, ",mode=%o", mode);
+
+	if (opts->delegate_cmds == ~0ULL)
+		seq_printf(m, ",delegate_cmds=any");
+	else if (opts->delegate_cmds)
+		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
+
+	if (opts->delegate_maps == ~0ULL)
+		seq_printf(m, ",delegate_maps=any");
+	else if (opts->delegate_maps)
+		seq_printf(m, ",delegate_maps=0x%llx", opts->delegate_maps);
+
+	if (opts->delegate_progs == ~0ULL)
+		seq_printf(m, ",delegate_progs=any");
+	else if (opts->delegate_progs)
+		seq_printf(m, ",delegate_progs=0x%llx", opts->delegate_progs);
+
+	if (opts->delegate_attachs == ~0ULL)
+		seq_printf(m, ",delegate_attachs=any");
+	else if (opts->delegate_attachs)
+		seq_printf(m, ",delegate_attachs=0x%llx", opts->delegate_attachs);
 	return 0;
 }
 
@@ -626,22 +648,27 @@  static const struct super_operations bpf_super_ops = {
 
 enum {
 	OPT_MODE,
+	OPT_DELEGATE_CMDS,
+	OPT_DELEGATE_MAPS,
+	OPT_DELEGATE_PROGS,
+	OPT_DELEGATE_ATTACHS,
 };
 
 static const struct fs_parameter_spec bpf_fs_parameters[] = {
 	fsparam_u32oct	("mode",			OPT_MODE),
+	fsparam_string	("delegate_cmds",		OPT_DELEGATE_CMDS),
+	fsparam_string	("delegate_maps",		OPT_DELEGATE_MAPS),
+	fsparam_string	("delegate_progs",		OPT_DELEGATE_PROGS),
+	fsparam_string	("delegate_attachs",		OPT_DELEGATE_ATTACHS),
 	{}
 };
 
-struct bpf_mount_opts {
-	umode_t mode;
-};
-
 static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
-	struct bpf_mount_opts *opts = fc->fs_private;
+	struct bpf_mount_opts *opts = fc->s_fs_info;
 	struct fs_parse_result result;
-	int opt;
+	int opt, err;
+	u64 msk;
 
 	opt = fs_parse(fc, bpf_fs_parameters, param, &result);
 	if (opt < 0) {
@@ -665,6 +692,25 @@  static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	case OPT_MODE:
 		opts->mode = result.uint_32 & S_IALLUGO;
 		break;
+	case OPT_DELEGATE_CMDS:
+	case OPT_DELEGATE_MAPS:
+	case OPT_DELEGATE_PROGS:
+	case OPT_DELEGATE_ATTACHS:
+		if (strcmp(param->string, "any") == 0) {
+			msk = ~0ULL;
+		} else {
+			err = kstrtou64(param->string, 0, &msk);
+			if (err)
+				return err;
+		}
+		switch (opt) {
+		case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break;
+		case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break;
+		case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break;
+		case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break;
+		default: return -EINVAL;
+		}
+		break;
 	}
 
 	return 0;
@@ -739,10 +785,14 @@  static int populate_bpffs(struct dentry *parent)
 static int bpf_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	static const struct tree_descr bpf_rfiles[] = { { "" } };
-	struct bpf_mount_opts *opts = fc->fs_private;
+	struct bpf_mount_opts *opts = sb->s_fs_info;
 	struct inode *inode;
 	int ret;
 
+	/* Mounting an instance of BPF FS requires privileges */
+	if (fc->user_ns != &init_user_ns && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	ret = simple_fill_super(sb, BPF_FS_MAGIC, bpf_rfiles);
 	if (ret)
 		return ret;
@@ -764,7 +814,10 @@  static int bpf_get_tree(struct fs_context *fc)
 
 static void bpf_free_fc(struct fs_context *fc)
 {
-	kfree(fc->fs_private);
+	struct bpf_mount_opts *opts = fc->s_fs_info;
+
+	if (opts)
+		kfree(opts);
 }
 
 static const struct fs_context_operations bpf_context_ops = {
@@ -786,17 +839,32 @@  static int bpf_init_fs_context(struct fs_context *fc)
 
 	opts->mode = S_IRWXUGO;
 
-	fc->fs_private = opts;
+	/* start out with no BPF token delegation enabled */
+	opts->delegate_cmds = 0;
+	opts->delegate_maps = 0;
+	opts->delegate_progs = 0;
+	opts->delegate_attachs = 0;
+
+	fc->s_fs_info = opts;
 	fc->ops = &bpf_context_ops;
 	return 0;
 }
 
+static void bpf_kill_super(struct super_block *sb)
+{
+	struct bpf_mount_opts *opts = sb->s_fs_info;
+
+	kill_litter_super(sb);
+	kfree(opts);
+}
+
 static struct file_system_type bpf_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "bpf",
 	.init_fs_context = bpf_init_fs_context,
 	.parameters	= bpf_fs_parameters,
-	.kill_sb	= kill_litter_super,
+	.kill_sb	= bpf_kill_super,
+	.fs_flags	= FS_USERNS_MOUNT,
 };
 
 static int __init bpf_init(void)