diff mbox series

[bpf-next,03/29] bpf: introduce BPF token object

Message ID 20240103222034.2582628-4-andrii@kernel.org (mailing list archive)
State Changes Requested
Delegated to: Paul Moore
Headers show
Series BPF token | expand

Commit Message

Andrii Nakryiko Jan. 3, 2024, 10:20 p.m. UTC
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.

This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).

BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.

When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.

Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).

Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h            |  41 +++++++
 include/uapi/linux/bpf.h       |  37 ++++++
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/inode.c             |  12 +-
 kernel/bpf/syscall.c           |  17 +++
 kernel/bpf/token.c             | 214 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  37 ++++++
 7 files changed, 354 insertions(+), 6 deletions(-)
 create mode 100644 kernel/bpf/token.c

Comments

Linus Torvalds Jan. 5, 2024, 8:25 p.m. UTC | #1
I'm still looking through the patches, but in the early parts I do
note this oddity:

On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
>
> +struct bpf_token {
> +       struct work_struct work;
> +       atomic64_t refcnt;
> +       struct user_namespace *userns;
> +       u64 allowed_cmds;
> +};

Ok, not huge, and makes sense, although I wonder if that

        atomic64_t refcnt;

should just be 'atomic_long_t' since presumably on 32-bit
architectures you can't create enough references for a 64-bit atomic
to make much sense.

Or are there references to tokens that might not use any memory?

Not a big deal, but 'atomic64_t' is very expensive on 32-bit
architectures, and doesn't seem to make much sense unless you really
specifically need 64 bits for some reason.

But regardless, this is odd:

> diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> +
> +static void bpf_token_free(struct bpf_token *token)
> +{
> +       put_user_ns(token->userns);
> +       kvfree(token);
> +}

> +int bpf_token_create(union bpf_attr *attr)
> +{
> ....
> +       token = kvzalloc(sizeof(*token), GFP_USER);

Ok, so the kvzalloc() and kvfree() certainly line up, but why use them at all?

kvmalloc() and friends are for "use kmalloc, and fall back on vmalloc
for big allocations when that fails".

For just a structure, a plain 'kzalloc()/kfree()' pair would seem to
make much more sense.

Neither of these issues are at all important, but I mention them
because they made me go "What?" when reading through the patches.

                  Linus
Matthew Wilcox Jan. 5, 2024, 8:32 p.m. UTC | #2
On Fri, Jan 05, 2024 at 12:25:42PM -0800, Linus Torvalds wrote:
> > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> > +
> > +static void bpf_token_free(struct bpf_token *token)
> > +{
> > +       put_user_ns(token->userns);
> > +       kvfree(token);
> > +}
> 
> > +int bpf_token_create(union bpf_attr *attr)
> > +{
> > ....
> > +       token = kvzalloc(sizeof(*token), GFP_USER);
> 
> Ok, so the kvzalloc() and kvfree() certainly line up, but why use them at all?
> 
> kvmalloc() and friends are for "use kmalloc, and fall back on vmalloc
> for big allocations when that fails".
> 
> For just a structure, a plain 'kzalloc()/kfree()' pair would seem to
> make much more sense.

I can't tell from the description whether there are going to be a lot of
these.  If there are, it might make sense to create a slab cache for
them rather than get them from the general-purpose kmalloc caches.
Linus Torvalds Jan. 5, 2024, 8:45 p.m. UTC | #3
On Fri, 5 Jan 2024 at 12:32, Matthew Wilcox <willy@infradead.org> wrote:
>
> I can't tell from the description whether there are going to be a lot of
> these.  If there are, it might make sense to create a slab cache for
> them rather than get them from the general-purpose kmalloc caches.

I suspect it's a "count on the fingers of your hand" thing, and having
a slab cache would be more overhead than you'd ever win.

           Linus
Linus Torvalds Jan. 5, 2024, 9:45 p.m. UTC | #4
Ok, I've gone through the whole series now, and I don't find anything
objectionable.

Which may only mean that I didn't notice something, of course, but at
least there's nothing I'd consider obvious.

I keep coming back to this 03/29 patch, because it's kind of the heart
of it, and I have one more small nit, but it's also purely stylistic:

On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
>
> +bool bpf_token_capable(const struct bpf_token *token, int cap)
> +{
> +       /* BPF token allows ns_capable() level of capabilities, but only if
> +        * token's userns is *exactly* the same as current user's userns
> +        */
> +       if (token && current_user_ns() == token->userns) {
> +               if (ns_capable(token->userns, cap))
> +                       return true;
> +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> +                       return true;
> +       }
> +       /* otherwise fallback to capable() checks */
> +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> +}

This *feels* like it should be written as

    bool bpf_token_capable(const struct bpf_token *token, int cap)
    {
        struct user_namespace *ns = &init_ns;

        /* BPF token allows ns_capable() level of capabilities, but only if
         * token's userns is *exactly* the same as current user's userns
         */
        if (token && current_user_ns() == token->userns)
                ns = token->userns;
        return ns_capable(ns, cap) ||
                (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
    }

And yes, I realize that the function will end up later growing a

        security_bpf_token_capable(token, cap)

test inside that 'if (token ..)' statement, and this would change the
order of that test so that the LSM hook would now be done before the
capability checks are done, but that all still seems just more of an
argument for the simplification.

So the end result would be something like

    bool bpf_token_capable(const struct bpf_token *token, int cap)
    {
        struct user_namespace *ns = &init_ns;

        if (token && current_user_ns() == token->userns) {
                if (security_bpf_token_capable(token, cap) < 0)
                        return false;
                ns = token->userns;
        }
        return ns_capable(ns, cap) ||
                (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
    }

although I feel that with that LSM hook, maybe this all should return
the error code (zero or negative), not a bool for success?

Also, should "current_user_ns() != token->userns" perhaps be an error
condition, rather than a "fall back to init_ns" condition?

Again, none of this is a big deal. I do think you're dropping the LSM
error code on the floor, and are duplicating the "ns_capable()" vs
"capable()" logic as-is, but none of this is a deal breaker, just more
of my commentary on the patch and about the logic here.

And yeah, I don't exactly love how you say "ok, if there's a token and
it doesn't match, I'll not use it" rather than "if the token namespace
doesn't match, it's an error", but maybe there's some usability issue
here?

              Linus
Andrii Nakryiko Jan. 5, 2024, 10:05 p.m. UTC | #5
On Fri, Jan 5, 2024 at 12:26 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> I'm still looking through the patches, but in the early parts I do
> note this oddity:
>
> On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > +struct bpf_token {
> > +       struct work_struct work;
> > +       atomic64_t refcnt;
> > +       struct user_namespace *userns;
> > +       u64 allowed_cmds;
> > +};
>
> Ok, not huge, and makes sense, although I wonder if that
>
>         atomic64_t refcnt;
>
> should just be 'atomic_long_t' since presumably on 32-bit
> architectures you can't create enough references for a 64-bit atomic
> to make much sense.
>
> Or are there references to tokens that might not use any memory?
>
> Not a big deal, but 'atomic64_t' is very expensive on 32-bit
> architectures, and doesn't seem to make much sense unless you really
> specifically need 64 bits for some reason.

I used atomic64_t for consistency with other BPF objects (program,
etc) and not to have to worry even about hypothetical overflows.
32-bit atomic performance doesn't seem to be a big concern as a token
is passed into a pretty heavy-weight operations that create new BPF
object (map, program, BTF object), no matter how slow refcounting is
it will be lost in the noise for those operations.

>
> But regardless, this is odd:
>
> > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> > +
> > +static void bpf_token_free(struct bpf_token *token)
> > +{
> > +       put_user_ns(token->userns);
> > +       kvfree(token);
> > +}
>
> > +int bpf_token_create(union bpf_attr *attr)
> > +{
> > ....
> > +       token = kvzalloc(sizeof(*token), GFP_USER);
>
> Ok, so the kvzalloc() and kvfree() certainly line up, but why use them at all?

No particular reason, kzalloc/kfree should totally work, it's a small
struct anyways. I will switch.

>
> kvmalloc() and friends are for "use kmalloc, and fall back on vmalloc
> for big allocations when that fails".
>
> For just a structure, a plain 'kzalloc()/kfree()' pair would seem to
> make much more sense.
>
> Neither of these issues are at all important, but I mention them
> because they made me go "What?" when reading through the patches.
>
>                   Linus
Andrii Nakryiko Jan. 5, 2024, 10:06 p.m. UTC | #6
On Fri, Jan 5, 2024 at 12:46 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> On Fri, 5 Jan 2024 at 12:32, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > I can't tell from the description whether there are going to be a lot of
> > these.  If there are, it might make sense to create a slab cache for
> > them rather than get them from the general-purpose kmalloc caches.
>
> I suspect it's a "count on the fingers of your hand" thing, and having
> a slab cache would be more overhead than you'd ever win.

Yes, you suspect right. It will be mostly one BPF token instance per
application, and even then only if the application is running within a
container that has BPF token set up (through BPF FS instance). So
yeah, slab cache seems like an overkill.

>
>            Linus
Andrii Nakryiko Jan. 5, 2024, 10:18 p.m. UTC | #7
On Fri, Jan 5, 2024 at 1:45 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> Ok, I've gone through the whole series now, and I don't find anything
> objectionable.

That's great, thanks for reviewing!

>
> Which may only mean that I didn't notice something, of course, but at
> least there's nothing I'd consider obvious.
>
> I keep coming back to this 03/29 patch, because it's kind of the heart
> of it, and I have one more small nit, but it's also purely stylistic:
>
> On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > +{
> > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > +        * token's userns is *exactly* the same as current user's userns
> > +        */
> > +       if (token && current_user_ns() == token->userns) {
> > +               if (ns_capable(token->userns, cap))
> > +                       return true;
> > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > +                       return true;
> > +       }
> > +       /* otherwise fallback to capable() checks */
> > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > +}
>
> This *feels* like it should be written as
>
>     bool bpf_token_capable(const struct bpf_token *token, int cap)
>     {
>         struct user_namespace *ns = &init_ns;
>
>         /* BPF token allows ns_capable() level of capabilities, but only if
>          * token's userns is *exactly* the same as current user's userns
>          */
>         if (token && current_user_ns() == token->userns)
>                 ns = token->userns;
>         return ns_capable(ns, cap) ||
>                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
>     }
>
> And yes, I realize that the function will end up later growing a
>
>         security_bpf_token_capable(token, cap)
>
> test inside that 'if (token ..)' statement, and this would change the
> order of that test so that the LSM hook would now be done before the
> capability checks are done, but that all still seems just more of an
> argument for the simplification.
>
> So the end result would be something like
>
>     bool bpf_token_capable(const struct bpf_token *token, int cap)
>     {
>         struct user_namespace *ns = &init_ns;
>
>         if (token && current_user_ns() == token->userns) {
>                 if (security_bpf_token_capable(token, cap) < 0)
>                         return false;
>                 ns = token->userns;
>         }
>         return ns_capable(ns, cap) ||
>                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
>     }

Yep, it makes sense to use ns_capable with init_ns. I'll change those
two patches to end up with something like what you suggested here.

>
> although I feel that with that LSM hook, maybe this all should return
> the error code (zero or negative), not a bool for success?
>
> Also, should "current_user_ns() != token->userns" perhaps be an error
> condition, rather than a "fall back to init_ns" condition?
>
> Again, none of this is a big deal. I do think you're dropping the LSM
> error code on the floor, and are duplicating the "ns_capable()" vs
> "capable()" logic as-is, but none of this is a deal breaker, just more
> of my commentary on the patch and about the logic here.
>
> And yeah, I don't exactly love how you say "ok, if there's a token and
> it doesn't match, I'll not use it" rather than "if the token namespace
> doesn't match, it's an error", but maybe there's some usability issue
> here?

Yes, usability was the primary concern. The overall idea with BPF
token is to make most BPF applications not care or even potentially
know about its existence, and mostly leave it up to administrators
and/or container managers to set up an environment with BPF token
delegation. To make that all possible, libbpf will opportunistically
try to create BPF token from BPF FS in the container (typically
/sys/fs/bpf, but it can be tuned, of course). And so if BPF token can
actually prevent, say, BPF program loading, because it didn't allow
particular program type to be loaded or whatnot, that would be a
regression of behavior relative to if BPF token was never even used.

So I consciously wanted a behavior in which BPF token can be used as a
sort of potential/additional rights, but otherwise just fallback to
current behavior based on capable(CAP_BPF) and other caps we use.

The alternative to the above would be creating a few more APIs to
proactively check if a given BPF token instance would allow whatever
operation libbpf needs to perform, and if not, not using it. Which
would be used to achieve the exact same behavior but in a more round
about way.

And the last piece of thinking was that if the user actually would
want to fail bpf() operation if the BPF token doesn't grant such
permissions, we can add a flag that would force this behavior. Some
sort of BPF_F_TOKEN_STRICT that can be optionally specified. But I
wanted to wait for an actual production use case that would want that
(I'm not aware of any right now).

>
>               Linus
Alexei Starovoitov Jan. 5, 2024, 10:27 p.m. UTC | #8
On Fri, Jan 5, 2024 at 2:06 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jan 5, 2024 at 12:26 PM Linus Torvalds
> <torvalds@linuxfoundation.org> wrote:
> >
> > I'm still looking through the patches, but in the early parts I do
> > note this oddity:
> >
> > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > +struct bpf_token {
> > > +       struct work_struct work;
> > > +       atomic64_t refcnt;
> > > +       struct user_namespace *userns;
> > > +       u64 allowed_cmds;
> > > +};
> >
> > Ok, not huge, and makes sense, although I wonder if that
> >
> >         atomic64_t refcnt;
> >
> > should just be 'atomic_long_t' since presumably on 32-bit
> > architectures you can't create enough references for a 64-bit atomic
> > to make much sense.
> >
> > Or are there references to tokens that might not use any memory?
> >
> > Not a big deal, but 'atomic64_t' is very expensive on 32-bit
> > architectures, and doesn't seem to make much sense unless you really
> > specifically need 64 bits for some reason.
>
> I used atomic64_t for consistency with other BPF objects (program,
> etc) and not to have to worry even about hypothetical overflows.
> 32-bit atomic performance doesn't seem to be a big concern as a token
> is passed into a pretty heavy-weight operations that create new BPF
> object (map, program, BTF object), no matter how slow refcounting is
> it will be lost in the noise for those operations.

To add a bit more context here...

Back in 2016 Jann managed to overflow 32-bit prog/map counters:
"
On a system with >32Gbyte of physical memory,
the malicious application may overflow 32-bit bpf program refcnt.
It's also possible to overflow map refcnt on 1Tb system.
"
We mitigated that with fixed limits:
-       atomic_inc(&map->refcnt);
+       if (atomic_inc_return(&map->refcnt) > BPF_MAX_REFCNT) {
+               atomic_dec(&map->refcnt);
+               return ERR_PTR(-EBUSY);
+       }
but it created quite a lot of error handling pain throughout
the code, so eventually in 2019 we switched to atomic64_t refcnt
and never looked back.
I suspect Jann will be able to overflow 32-bit token refcnt,
so atomic64 was chosen for simplicity.
atomic_long_t might work too, but the effort to think it through
is not worth it at this point, since performance of
inc/dec doesn't matter here.

Eventually we can do a follow up and consistently update
all such counters to atomic_long_t.
Christian Brauner Jan. 8, 2024, 11:44 a.m. UTC | #9
On Wed, Jan 03, 2024 at 02:20:08PM -0800, Andrii Nakryiko wrote:
> Add new kind of BPF kernel object, BPF token. BPF token is meant to
> allow delegating privileged BPF functionality, like loading a BPF
> program or creating a BPF map, from privileged process to a *trusted*
> unprivileged process, all while having a good amount of control over which
> privileged operations could be performed using provided BPF token.
> 
> This is achieved through mounting BPF FS instance with extra delegation
> mount options, which determine what operations are delegatable, and also
> constraining it to the owning user namespace (as mentioned in the
> previous patch).
> 
> BPF token itself is just a derivative from BPF FS and can be created
> through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
> FS FD, which can be attained through open() API by opening BPF FS mount
> point. Currently, BPF token "inherits" delegated command, map types,
> prog type, and attach type bit sets from BPF FS as is. In the future,
> having an BPF token as a separate object with its own FD, we can allow
> to further restrict BPF token's allowable set of things either at the
> creation time or after the fact, allowing the process to guard itself
> further from unintentionally trying to load undesired kind of BPF
> programs. But for now we keep things simple and just copy bit sets as is.
> 
> When BPF token is created from BPF FS mount, we take reference to the
> BPF super block's owning user namespace, and then use that namespace for
> checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
> capabilities that are normally only checked against init userns (using
> capable()), but now we check them using ns_capable() instead (if BPF
> token is provided). See bpf_token_capable() for details.
> 
> Such setup means that BPF token in itself is not sufficient to grant BPF
> functionality. User namespaced process has to *also* have necessary
> combination of capabilities inside that user namespace. So while
> previously CAP_BPF was useless when granted within user namespace, now
> it gains a meaning and allows container managers and sys admins to have
> a flexible control over which processes can and need to use BPF
> functionality within the user namespace (i.e., container in practice).
> And BPF FS delegation mount options and derived BPF tokens serve as
> a per-container "flag" to grant overall ability to use bpf() (plus further
> restrict on which parts of bpf() syscalls are treated as namespaced).
> 
> Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
> within the BPF FS owning user namespace, rounding up the ns_capable()
> story of BPF token.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---

Acked-by: Christian Brauner <brauner@kernel.org>
Christian Brauner Jan. 8, 2024, 12:01 p.m. UTC | #10
> Also, should "current_user_ns() != token->userns" perhaps be an error
> condition, rather than a "fall back to init_ns" condition?

Yes, I've pointed this out before:

"Please enforce that in order to use a token the caller must be in the
same user namespace as the token as well. IOW, we don't want to yet make
it possible to use a token created in an ancestor user namespace to load
or attach bpf programs in a descendant user namespace. Let's be as
restrictive as we can: tokens are only valid within the user namespace
they were created in."

[1] Re: [PATCH v11 bpf-next 03/17] bpf: introduce BPF token object
    https://lore.kernel.org/r/20231130-katzen-anhand-7ad530f187da@brauner

> 
> Again, none of this is a big deal. I do think you're dropping the LSM
> error code on the floor, and are duplicating the "ns_capable()" vs
> "capable()" logic as-is, but none of this is a deal breaker, just more
> of my commentary on the patch and about the logic here.
> 
> And yeah, I don't exactly love how you say "ok, if there's a token and
> it doesn't match, I'll not use it" rather than "if the token namespace
> doesn't match, it's an error", but maybe there's some usability issue
> here?
Christian Brauner Jan. 8, 2024, 12:02 p.m. UTC | #11
On Fri, Jan 05, 2024 at 02:18:40PM -0800, Andrii Nakryiko wrote:
> On Fri, Jan 5, 2024 at 1:45 PM Linus Torvalds
> <torvalds@linuxfoundation.org> wrote:
> >
> > Ok, I've gone through the whole series now, and I don't find anything
> > objectionable.
> 
> That's great, thanks for reviewing!
> 
> >
> > Which may only mean that I didn't notice something, of course, but at
> > least there's nothing I'd consider obvious.
> >
> > I keep coming back to this 03/29 patch, because it's kind of the heart
> > of it, and I have one more small nit, but it's also purely stylistic:
> >
> > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > +{
> > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > +        * token's userns is *exactly* the same as current user's userns
> > > +        */
> > > +       if (token && current_user_ns() == token->userns) {
> > > +               if (ns_capable(token->userns, cap))
> > > +                       return true;
> > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > +                       return true;
> > > +       }
> > > +       /* otherwise fallback to capable() checks */
> > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > +}
> >
> > This *feels* like it should be written as
> >
> >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> >     {
> >         struct user_namespace *ns = &init_ns;
> >
> >         /* BPF token allows ns_capable() level of capabilities, but only if
> >          * token's userns is *exactly* the same as current user's userns
> >          */
> >         if (token && current_user_ns() == token->userns)
> >                 ns = token->userns;
> >         return ns_capable(ns, cap) ||
> >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> >     }
> >
> > And yes, I realize that the function will end up later growing a
> >
> >         security_bpf_token_capable(token, cap)
> >
> > test inside that 'if (token ..)' statement, and this would change the
> > order of that test so that the LSM hook would now be done before the
> > capability checks are done, but that all still seems just more of an
> > argument for the simplification.
> >
> > So the end result would be something like
> >
> >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> >     {
> >         struct user_namespace *ns = &init_ns;
> >
> >         if (token && current_user_ns() == token->userns) {
> >                 if (security_bpf_token_capable(token, cap) < 0)
> >                         return false;
> >                 ns = token->userns;
> >         }
> >         return ns_capable(ns, cap) ||
> >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> >     }
> 
> Yep, it makes sense to use ns_capable with init_ns. I'll change those
> two patches to end up with something like what you suggested here.
> 
> >
> > although I feel that with that LSM hook, maybe this all should return
> > the error code (zero or negative), not a bool for success?
> >
> > Also, should "current_user_ns() != token->userns" perhaps be an error
> > condition, rather than a "fall back to init_ns" condition?
> >
> > Again, none of this is a big deal. I do think you're dropping the LSM
> > error code on the floor, and are duplicating the "ns_capable()" vs
> > "capable()" logic as-is, but none of this is a deal breaker, just more
> > of my commentary on the patch and about the logic here.
> >
> > And yeah, I don't exactly love how you say "ok, if there's a token and
> > it doesn't match, I'll not use it" rather than "if the token namespace
> > doesn't match, it's an error", but maybe there's some usability issue
> > here?
> 
> Yes, usability was the primary concern. The overall idea with BPF

NAK on not restricting this to not erroring out on current_user_ns()
!= token->user_ns. I've said this multiple times before.
Paul Moore Jan. 8, 2024, 4:45 p.m. UTC | #12
On Fri, Jan 5, 2024 at 4:45 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
> On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > +{
> > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > +        * token's userns is *exactly* the same as current user's userns
> > +        */
> > +       if (token && current_user_ns() == token->userns) {
> > +               if (ns_capable(token->userns, cap))
> > +                       return true;
> > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > +                       return true;
> > +       }
> > +       /* otherwise fallback to capable() checks */
> > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > +}
>
> This *feels* like it should be written as
>
>     bool bpf_token_capable(const struct bpf_token *token, int cap)
>     {
>         struct user_namespace *ns = &init_ns;
>
>         /* BPF token allows ns_capable() level of capabilities, but only if
>          * token's userns is *exactly* the same as current user's userns
>          */
>         if (token && current_user_ns() == token->userns)
>                 ns = token->userns;
>         return ns_capable(ns, cap) ||
>                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
>     }
>
> And yes, I realize that the function will end up later growing a
>
>         security_bpf_token_capable(token, cap)
>
> test inside that 'if (token ..)' statement, and this would change the
> order of that test so that the LSM hook would now be done before the
> capability checks are done, but that all still seems just more of an
> argument for the simplification.

I have no problem with rewriting things, my only ask is that we stick
with the idea of doing the capability checks before the LSM hook.  The
DAC-before-MAC (capability-before-LSM) pattern is one we try to stick
to most everywhere in the kernel and deviating from it here could
potentially result in some odd/unexpected behavior from a user
perspective.
Andrii Nakryiko Jan. 8, 2024, 11:58 p.m. UTC | #13
On Mon, Jan 8, 2024 at 4:02 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Jan 05, 2024 at 02:18:40PM -0800, Andrii Nakryiko wrote:
> > On Fri, Jan 5, 2024 at 1:45 PM Linus Torvalds
> > <torvalds@linuxfoundation.org> wrote:
> > >
> > > Ok, I've gone through the whole series now, and I don't find anything
> > > objectionable.
> >
> > That's great, thanks for reviewing!
> >
> > >
> > > Which may only mean that I didn't notice something, of course, but at
> > > least there's nothing I'd consider obvious.
> > >
> > > I keep coming back to this 03/29 patch, because it's kind of the heart
> > > of it, and I have one more small nit, but it's also purely stylistic:
> > >
> > > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > +{
> > > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > > +        * token's userns is *exactly* the same as current user's userns
> > > > +        */
> > > > +       if (token && current_user_ns() == token->userns) {
> > > > +               if (ns_capable(token->userns, cap))
> > > > +                       return true;
> > > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > > +                       return true;
> > > > +       }
> > > > +       /* otherwise fallback to capable() checks */
> > > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > +}
> > >
> > > This *feels* like it should be written as
> > >
> > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > >     {
> > >         struct user_namespace *ns = &init_ns;
> > >
> > >         /* BPF token allows ns_capable() level of capabilities, but only if
> > >          * token's userns is *exactly* the same as current user's userns
> > >          */
> > >         if (token && current_user_ns() == token->userns)
> > >                 ns = token->userns;
> > >         return ns_capable(ns, cap) ||
> > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > >     }
> > >
> > > And yes, I realize that the function will end up later growing a
> > >
> > >         security_bpf_token_capable(token, cap)
> > >
> > > test inside that 'if (token ..)' statement, and this would change the
> > > order of that test so that the LSM hook would now be done before the
> > > capability checks are done, but that all still seems just more of an
> > > argument for the simplification.
> > >
> > > So the end result would be something like
> > >
> > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > >     {
> > >         struct user_namespace *ns = &init_ns;
> > >
> > >         if (token && current_user_ns() == token->userns) {
> > >                 if (security_bpf_token_capable(token, cap) < 0)
> > >                         return false;
> > >                 ns = token->userns;
> > >         }
> > >         return ns_capable(ns, cap) ||
> > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > >     }
> >
> > Yep, it makes sense to use ns_capable with init_ns. I'll change those
> > two patches to end up with something like what you suggested here.
> >
> > >
> > > although I feel that with that LSM hook, maybe this all should return
> > > the error code (zero or negative), not a bool for success?
> > >
> > > Also, should "current_user_ns() != token->userns" perhaps be an error
> > > condition, rather than a "fall back to init_ns" condition?
> > >
> > > Again, none of this is a big deal. I do think you're dropping the LSM
> > > error code on the floor, and are duplicating the "ns_capable()" vs
> > > "capable()" logic as-is, but none of this is a deal breaker, just more
> > > of my commentary on the patch and about the logic here.
> > >
> > > And yeah, I don't exactly love how you say "ok, if there's a token and
> > > it doesn't match, I'll not use it" rather than "if the token namespace
> > > doesn't match, it's an error", but maybe there's some usability issue
> > > here?
> >
> > Yes, usability was the primary concern. The overall idea with BPF
>
> NAK on not restricting this to not erroring out on current_user_ns()
> != token->user_ns. I've said this multiple times before.

I do restrict token usage to *exact* userns in which the token was
created. See bpf_token_capable()'s

if (token && current_user_ns() == token->userns) { ... }

and in bpf_token_allow_cmd():

if (!token || current_user_ns() != token->userns)
    return false;

So I followed what you asked in [1] (just like I said I will in [2]),
unless I made some stupid mistake which I cannot even see.


What we are discussing here is a different question. It's the
difference between erroring out (that is, failing whatever BPF
operation was attempted with such token, i.e., program loading or map
creation) vs ignoring the token altogether and just using
init_ns-based capable() checks. And the latter is vastly more user
friendly when considering end-to-end integration with user-space
applications and tooling. And doesn't seem to open any security holes.

  [1] https://lore.kernel.org/r/20231130-katzen-anhand-7ad530f187da@brauner
  [2] https://lore.kernel.org/all/CAEf4BzZA2or352VkAaBsr+fsWAGO1Cs_gonH7Ffm5emXGE+2Ug@mail.gmail.com/
Andrii Nakryiko Jan. 9, 2024, 12:07 a.m. UTC | #14
On Mon, Jan 8, 2024 at 8:45 AM Paul Moore <paul@paul-moore.com> wrote:
>
> On Fri, Jan 5, 2024 at 4:45 PM Linus Torvalds
> <torvalds@linuxfoundation.org> wrote:
> > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > +{
> > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > +        * token's userns is *exactly* the same as current user's userns
> > > +        */
> > > +       if (token && current_user_ns() == token->userns) {
> > > +               if (ns_capable(token->userns, cap))
> > > +                       return true;
> > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > +                       return true;
> > > +       }
> > > +       /* otherwise fallback to capable() checks */
> > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > +}
> >
> > This *feels* like it should be written as
> >
> >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> >     {
> >         struct user_namespace *ns = &init_ns;
> >
> >         /* BPF token allows ns_capable() level of capabilities, but only if
> >          * token's userns is *exactly* the same as current user's userns
> >          */
> >         if (token && current_user_ns() == token->userns)
> >                 ns = token->userns;
> >         return ns_capable(ns, cap) ||
> >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> >     }
> >
> > And yes, I realize that the function will end up later growing a
> >
> >         security_bpf_token_capable(token, cap)
> >
> > test inside that 'if (token ..)' statement, and this would change the
> > order of that test so that the LSM hook would now be done before the
> > capability checks are done, but that all still seems just more of an
> > argument for the simplification.
>
> I have no problem with rewriting things, my only ask is that we stick
> with the idea of doing the capability checks before the LSM hook.  The
> DAC-before-MAC (capability-before-LSM) pattern is one we try to stick
> to most everywhere in the kernel and deviating from it here could
> potentially result in some odd/unexpected behavior from a user
> perspective.

Makes sense, Paul. With the suggested rewrite we'll get an LSM call
before we get to ns_capable() (which we avoid doing in BPF code base,
generally speaking, after someone called this out earlier). Hmm...

I guess it will be better to keep this logic as is then, I believe it
was more of a subjective stylistical nit from Linus, so it probably is
ok to keep existing code.

Alternatively we could do something like:

struct user_namespace *ns = &init_ns;

if (token && current_user_ns() == token->userns)
    ns = token->user_ns;
else
    token = NULL;

if (ns_capable(ns, cap) || (cap != CAP_SYS_ADMIN && ns_capable(ns,
CAP_SYS_ADMIN)) {
    if (token)
        return security_bpf_token_capable(token, cap) == 0;
    return true;
}
return false;

Or something along those lines? I don't particularly care (though the
latter seems a bit more ceremonious), so please let me know the
preference, if any.


>
> --
> paul-moore.com
Christian Brauner Jan. 9, 2024, 2:52 p.m. UTC | #15
On Mon, Jan 08, 2024 at 03:58:47PM -0800, Andrii Nakryiko wrote:
> On Mon, Jan 8, 2024 at 4:02 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Fri, Jan 05, 2024 at 02:18:40PM -0800, Andrii Nakryiko wrote:
> > > On Fri, Jan 5, 2024 at 1:45 PM Linus Torvalds
> > > <torvalds@linuxfoundation.org> wrote:
> > > >
> > > > Ok, I've gone through the whole series now, and I don't find anything
> > > > objectionable.
> > >
> > > That's great, thanks for reviewing!
> > >
> > > >
> > > > Which may only mean that I didn't notice something, of course, but at
> > > > least there's nothing I'd consider obvious.
> > > >
> > > > I keep coming back to this 03/29 patch, because it's kind of the heart
> > > > of it, and I have one more small nit, but it's also purely stylistic:
> > > >
> > > > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > +{
> > > > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > > > +        * token's userns is *exactly* the same as current user's userns
> > > > > +        */
> > > > > +       if (token && current_user_ns() == token->userns) {
> > > > > +               if (ns_capable(token->userns, cap))
> > > > > +                       return true;
> > > > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > > > +                       return true;
> > > > > +       }
> > > > > +       /* otherwise fallback to capable() checks */
> > > > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > > +}
> > > >
> > > > This *feels* like it should be written as
> > > >
> > > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > >     {
> > > >         struct user_namespace *ns = &init_ns;
> > > >
> > > >         /* BPF token allows ns_capable() level of capabilities, but only if
> > > >          * token's userns is *exactly* the same as current user's userns
> > > >          */
> > > >         if (token && current_user_ns() == token->userns)
> > > >                 ns = token->userns;
> > > >         return ns_capable(ns, cap) ||
> > > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > >     }
> > > >
> > > > And yes, I realize that the function will end up later growing a
> > > >
> > > >         security_bpf_token_capable(token, cap)
> > > >
> > > > test inside that 'if (token ..)' statement, and this would change the
> > > > order of that test so that the LSM hook would now be done before the
> > > > capability checks are done, but that all still seems just more of an
> > > > argument for the simplification.
> > > >
> > > > So the end result would be something like
> > > >
> > > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > >     {
> > > >         struct user_namespace *ns = &init_ns;
> > > >
> > > >         if (token && current_user_ns() == token->userns) {
> > > >                 if (security_bpf_token_capable(token, cap) < 0)
> > > >                         return false;
> > > >                 ns = token->userns;
> > > >         }
> > > >         return ns_capable(ns, cap) ||
> > > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > >     }
> > >
> > > Yep, it makes sense to use ns_capable with init_ns. I'll change those
> > > two patches to end up with something like what you suggested here.
> > >
> > > >
> > > > although I feel that with that LSM hook, maybe this all should return
> > > > the error code (zero or negative), not a bool for success?
> > > >
> > > > Also, should "current_user_ns() != token->userns" perhaps be an error
> > > > condition, rather than a "fall back to init_ns" condition?
> > > >
> > > > Again, none of this is a big deal. I do think you're dropping the LSM
> > > > error code on the floor, and are duplicating the "ns_capable()" vs
> > > > "capable()" logic as-is, but none of this is a deal breaker, just more
> > > > of my commentary on the patch and about the logic here.
> > > >
> > > > And yeah, I don't exactly love how you say "ok, if there's a token and
> > > > it doesn't match, I'll not use it" rather than "if the token namespace
> > > > doesn't match, it's an error", but maybe there's some usability issue
> > > > here?
> > >
> > > Yes, usability was the primary concern. The overall idea with BPF
> >
> > NAK on not restricting this to not erroring out on current_user_ns()
> > != token->user_ns. I've said this multiple times before.
> 
> I do restrict token usage to *exact* userns in which the token was
> created. See bpf_token_capable()'s
> 
> if (token && current_user_ns() == token->userns) { ... }
> 
> and in bpf_token_allow_cmd():
> 
> if (!token || current_user_ns() != token->userns)
>     return false;
> 
> So I followed what you asked in [1] (just like I said I will in [2]),
> unless I made some stupid mistake which I cannot even see.
> 
> 
> What we are discussing here is a different question. It's the
> difference between erroring out (that is, failing whatever BPF
> operation was attempted with such token, i.e., program loading or map
> creation) vs ignoring the token altogether and just using
> init_ns-based capable() checks. And the latter is vastly more user

Look at this:

+bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+       /* BPF token allows ns_capable() level of capabilities, but only if
+        * token's userns is *exactly* the same as current user's userns
+        */
+       if (token && current_user_ns() == token->userns) {
+               if (ns_capable(token->userns, cap))
+                       return true;
+               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
+                       return true;
+       }
+       /* otherwise fallback to capable() checks */
+       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}

How on earth is it possible that the calling task is in a user namespace
aka current_user_ns() == token->userns while at the same time being
capable in the initial user namespace? When you enter an
unprivileged user namespace you lose all capabilities against your
ancestor user namespace and you can't reenter your ancestor user
namespace.

IOW, if current_user_ns() == token->userns and token->userns !=
init_user_ns, then current_user_ns() != init_user_ns. And therefore that
thing is essentially always false for all interesting cases, no?

Aside from that it would be semantically completely unclean. The user
has specified a token and permission checking should be based on that
token and not magically fallback to a capable check in the inital user
namespace even if that worked.

Because the only scenario where that is maybe useful is if an
unprivileged container has dropped _both_ CAP_BPF and CAP_SYS_ADMIN from
the user namespace of the container.

First of, why? What thread model do you have then? Second, if you do
stupid stuff like that then you don't get bpf in the container via bpf
tokens. Period.

Restrict the meaning and validity of a bpf token to the user namespace
and do not include escape hatches such as this. Especially not in this
initial version, please.

I'm not trying to be difficult but it's clear that the implications of
user namespaces aren't well understood here. And historicaly they are
exploit facilitators as much as exploit preventers.
Andrii Nakryiko Jan. 9, 2024, 7 p.m. UTC | #16
On Tue, Jan 9, 2024 at 6:52 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Jan 08, 2024 at 03:58:47PM -0800, Andrii Nakryiko wrote:
> > On Mon, Jan 8, 2024 at 4:02 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Fri, Jan 05, 2024 at 02:18:40PM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Jan 5, 2024 at 1:45 PM Linus Torvalds
> > > > <torvalds@linuxfoundation.org> wrote:
> > > > >
> > > > > Ok, I've gone through the whole series now, and I don't find anything
> > > > > objectionable.
> > > >
> > > > That's great, thanks for reviewing!
> > > >
> > > > >
> > > > > Which may only mean that I didn't notice something, of course, but at
> > > > > least there's nothing I'd consider obvious.
> > > > >
> > > > > I keep coming back to this 03/29 patch, because it's kind of the heart
> > > > > of it, and I have one more small nit, but it's also purely stylistic:
> > > > >
> > > > > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > >
> > > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > > +{
> > > > > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > > > > +        * token's userns is *exactly* the same as current user's userns
> > > > > > +        */
> > > > > > +       if (token && current_user_ns() == token->userns) {
> > > > > > +               if (ns_capable(token->userns, cap))
> > > > > > +                       return true;
> > > > > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > > > > +                       return true;
> > > > > > +       }
> > > > > > +       /* otherwise fallback to capable() checks */
> > > > > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > > > +}
> > > > >
> > > > > This *feels* like it should be written as
> > > > >
> > > > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > >     {
> > > > >         struct user_namespace *ns = &init_ns;
> > > > >
> > > > >         /* BPF token allows ns_capable() level of capabilities, but only if
> > > > >          * token's userns is *exactly* the same as current user's userns
> > > > >          */
> > > > >         if (token && current_user_ns() == token->userns)
> > > > >                 ns = token->userns;
> > > > >         return ns_capable(ns, cap) ||
> > > > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > >     }
> > > > >
> > > > > And yes, I realize that the function will end up later growing a
> > > > >
> > > > >         security_bpf_token_capable(token, cap)
> > > > >
> > > > > test inside that 'if (token ..)' statement, and this would change the
> > > > > order of that test so that the LSM hook would now be done before the
> > > > > capability checks are done, but that all still seems just more of an
> > > > > argument for the simplification.
> > > > >
> > > > > So the end result would be something like
> > > > >
> > > > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > >     {
> > > > >         struct user_namespace *ns = &init_ns;
> > > > >
> > > > >         if (token && current_user_ns() == token->userns) {
> > > > >                 if (security_bpf_token_capable(token, cap) < 0)
> > > > >                         return false;
> > > > >                 ns = token->userns;
> > > > >         }
> > > > >         return ns_capable(ns, cap) ||
> > > > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > >     }
> > > >
> > > > Yep, it makes sense to use ns_capable with init_ns. I'll change those
> > > > two patches to end up with something like what you suggested here.
> > > >
> > > > >
> > > > > although I feel that with that LSM hook, maybe this all should return
> > > > > the error code (zero or negative), not a bool for success?
> > > > >
> > > > > Also, should "current_user_ns() != token->userns" perhaps be an error
> > > > > condition, rather than a "fall back to init_ns" condition?
> > > > >
> > > > > Again, none of this is a big deal. I do think you're dropping the LSM
> > > > > error code on the floor, and are duplicating the "ns_capable()" vs
> > > > > "capable()" logic as-is, but none of this is a deal breaker, just more
> > > > > of my commentary on the patch and about the logic here.
> > > > >
> > > > > And yeah, I don't exactly love how you say "ok, if there's a token and
> > > > > it doesn't match, I'll not use it" rather than "if the token namespace
> > > > > doesn't match, it's an error", but maybe there's some usability issue
> > > > > here?
> > > >
> > > > Yes, usability was the primary concern. The overall idea with BPF
> > >
> > > NAK on not restricting this to not erroring out on current_user_ns()
> > > != token->user_ns. I've said this multiple times before.
> >
> > I do restrict token usage to *exact* userns in which the token was
> > created. See bpf_token_capable()'s
> >
> > if (token && current_user_ns() == token->userns) { ... }
> >
> > and in bpf_token_allow_cmd():
> >
> > if (!token || current_user_ns() != token->userns)
> >     return false;
> >
> > So I followed what you asked in [1] (just like I said I will in [2]),
> > unless I made some stupid mistake which I cannot even see.
> >
> >
> > What we are discussing here is a different question. It's the
> > difference between erroring out (that is, failing whatever BPF
> > operation was attempted with such token, i.e., program loading or map
> > creation) vs ignoring the token altogether and just using
> > init_ns-based capable() checks. And the latter is vastly more user
>
> Look at this:
>
> +bool bpf_token_capable(const struct bpf_token *token, int cap)
> +{
> +       /* BPF token allows ns_capable() level of capabilities, but only if
> +        * token's userns is *exactly* the same as current user's userns
> +        */
> +       if (token && current_user_ns() == token->userns) {
> +               if (ns_capable(token->userns, cap))
> +                       return true;
> +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> +                       return true;
> +       }
> +       /* otherwise fallback to capable() checks */
> +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> +}
>
> How on earth is it possible that the calling task is in a user namespace
> aka current_user_ns() == token->userns while at the same time being
> capable in the initial user namespace? When you enter an
> unprivileged user namespace you lose all capabilities against your
> ancestor user namespace and you can't reenter your ancestor user
> namespace.
>
> IOW, if current_user_ns() == token->userns and token->userns !=
> init_user_ns, then current_user_ns() != init_user_ns. And therefore that
> thing is essentially always false for all interesting cases, no?
>

Are you saying that this would be better?

   if (token && current_user_ns() == token->userns) {
       if (ns_capable(token->userns, cap))
           return true;
       if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
           return true;
       if (token->userns != &init_user_ns)
           return false;
   }
   /* otherwise fallback to capable() checks */
   return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));


I.e., return false directly if token's userns is not initns (there
will be also LSM check before this condition later on)? Falling back
to capable() checks and letting it return false if we are not in
init_ns or don't have capabilities seemed fine to me, that's all.


> Aside from that it would be semantically completely unclean. The user
> has specified a token and permission checking should be based on that
> token and not magically fallback to a capable check in the inital user
> namespace even if that worked.

I tried to explain the higher-level integration setup in [0]. The
thing is that users most of the time won't be explicitly passing a
token, BPF library will be passing it, if /sys/fs/bpf happens to be
mounted with delegation options.

So I wanted to avoid potential regressions (unintended and avoidable
failures) from using BPF token, because it might be hard to tell if a
BPF token is "beneficial" and is granting required permissions
(especially if you take into account LSM interactions). So I
consistently treat BPF token as optional/add-on permissions, not the
replacement for capable() checks.

It's true that it's unlikely that BPF token will be set up in init_ns
(except for testing, perhaps), but is it a reason to return -EPERM
without doing the same checks that would be done if BPF token wasn't
provided?


  [0] https://lore.kernel.org/bpf/CAEf4Bzb6jnJL98SLPJB7Vjxo_O33W8HjJuAsyP3+6xigZtsTkA@mail.gmail.com/

>
> Because the only scenario where that is maybe useful is if an
> unprivileged container has dropped _both_ CAP_BPF and CAP_SYS_ADMIN from
> the user namespace of the container.
>
> First of, why? What thread model do you have then? Second, if you do
> stupid stuff like that then you don't get bpf in the container via bpf
> tokens. Period.
>
> Restrict the meaning and validity of a bpf token to the user namespace
> and do not include escape hatches such as this. Especially not in this
> initial version, please.

This decision fundamentally changes how BPF loader libraries like
libbpf will have to approach BPF token integration. It's not a small
thing and not something that will be easy to change later.

>
> I'm not trying to be difficult but it's clear that the implications of
> user namespaces aren't well understood here. And historicaly they are

I don't know why you are saying this. You haven't pointed out anything
that is actually broken in the existing implementation. Sure, you
might not be a fan of the approach, but is there anything
*technically* wrong with ignoring BPF token if it doesn't provide
necessary permissions for BPF operation and consistently using the
checks that would be performed with BPF token?

> exploit facilitators as much as exploit preventers.
Christian Brauner Jan. 10, 2024, 2:59 p.m. UTC | #17
On Tue, Jan 09, 2024 at 11:00:24AM -0800, Andrii Nakryiko wrote:
> On Tue, Jan 9, 2024 at 6:52 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Mon, Jan 08, 2024 at 03:58:47PM -0800, Andrii Nakryiko wrote:
> > > On Mon, Jan 8, 2024 at 4:02 AM Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > On Fri, Jan 05, 2024 at 02:18:40PM -0800, Andrii Nakryiko wrote:
> > > > > On Fri, Jan 5, 2024 at 1:45 PM Linus Torvalds
> > > > > <torvalds@linuxfoundation.org> wrote:
> > > > > >
> > > > > > Ok, I've gone through the whole series now, and I don't find anything
> > > > > > objectionable.
> > > > >
> > > > > That's great, thanks for reviewing!
> > > > >
> > > > > >
> > > > > > Which may only mean that I didn't notice something, of course, but at
> > > > > > least there's nothing I'd consider obvious.
> > > > > >
> > > > > > I keep coming back to this 03/29 patch, because it's kind of the heart
> > > > > > of it, and I have one more small nit, but it's also purely stylistic:
> > > > > >
> > > > > > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > > >
> > > > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > > > +{
> > > > > > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > > > > > +        * token's userns is *exactly* the same as current user's userns
> > > > > > > +        */
> > > > > > > +       if (token && current_user_ns() == token->userns) {
> > > > > > > +               if (ns_capable(token->userns, cap))
> > > > > > > +                       return true;
> > > > > > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > > > > > +                       return true;
> > > > > > > +       }
> > > > > > > +       /* otherwise fallback to capable() checks */
> > > > > > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > > > > +}
> > > > > >
> > > > > > This *feels* like it should be written as
> > > > > >
> > > > > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > >     {
> > > > > >         struct user_namespace *ns = &init_ns;
> > > > > >
> > > > > >         /* BPF token allows ns_capable() level of capabilities, but only if
> > > > > >          * token's userns is *exactly* the same as current user's userns
> > > > > >          */
> > > > > >         if (token && current_user_ns() == token->userns)
> > > > > >                 ns = token->userns;
> > > > > >         return ns_capable(ns, cap) ||
> > > > > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > > >     }
> > > > > >
> > > > > > And yes, I realize that the function will end up later growing a
> > > > > >
> > > > > >         security_bpf_token_capable(token, cap)
> > > > > >
> > > > > > test inside that 'if (token ..)' statement, and this would change the
> > > > > > order of that test so that the LSM hook would now be done before the
> > > > > > capability checks are done, but that all still seems just more of an
> > > > > > argument for the simplification.
> > > > > >
> > > > > > So the end result would be something like
> > > > > >
> > > > > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > >     {
> > > > > >         struct user_namespace *ns = &init_ns;
> > > > > >
> > > > > >         if (token && current_user_ns() == token->userns) {
> > > > > >                 if (security_bpf_token_capable(token, cap) < 0)
> > > > > >                         return false;
> > > > > >                 ns = token->userns;
> > > > > >         }
> > > > > >         return ns_capable(ns, cap) ||
> > > > > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > > >     }
> > > > >
> > > > > Yep, it makes sense to use ns_capable with init_ns. I'll change those
> > > > > two patches to end up with something like what you suggested here.
> > > > >
> > > > > >
> > > > > > although I feel that with that LSM hook, maybe this all should return
> > > > > > the error code (zero or negative), not a bool for success?
> > > > > >
> > > > > > Also, should "current_user_ns() != token->userns" perhaps be an error
> > > > > > condition, rather than a "fall back to init_ns" condition?
> > > > > >
> > > > > > Again, none of this is a big deal. I do think you're dropping the LSM
> > > > > > error code on the floor, and are duplicating the "ns_capable()" vs
> > > > > > "capable()" logic as-is, but none of this is a deal breaker, just more
> > > > > > of my commentary on the patch and about the logic here.
> > > > > >
> > > > > > And yeah, I don't exactly love how you say "ok, if there's a token and
> > > > > > it doesn't match, I'll not use it" rather than "if the token namespace
> > > > > > doesn't match, it's an error", but maybe there's some usability issue
> > > > > > here?
> > > > >
> > > > > Yes, usability was the primary concern. The overall idea with BPF
> > > >
> > > > NAK on not restricting this to not erroring out on current_user_ns()
> > > > != token->user_ns. I've said this multiple times before.
> > >
> > > I do restrict token usage to *exact* userns in which the token was
> > > created. See bpf_token_capable()'s
> > >
> > > if (token && current_user_ns() == token->userns) { ... }
> > >
> > > and in bpf_token_allow_cmd():
> > >
> > > if (!token || current_user_ns() != token->userns)
> > >     return false;
> > >
> > > So I followed what you asked in [1] (just like I said I will in [2]),
> > > unless I made some stupid mistake which I cannot even see.
> > >
> > >
> > > What we are discussing here is a different question. It's the
> > > difference between erroring out (that is, failing whatever BPF
> > > operation was attempted with such token, i.e., program loading or map
> > > creation) vs ignoring the token altogether and just using
> > > init_ns-based capable() checks. And the latter is vastly more user
> >
> > Look at this:
> >
> > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > +{
> > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > +        * token's userns is *exactly* the same as current user's userns
> > +        */
> > +       if (token && current_user_ns() == token->userns) {
> > +               if (ns_capable(token->userns, cap))
> > +                       return true;
> > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > +                       return true;
> > +       }
> > +       /* otherwise fallback to capable() checks */
> > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > +}
> >
> > How on earth is it possible that the calling task is in a user namespace
> > aka current_user_ns() == token->userns while at the same time being
> > capable in the initial user namespace? When you enter an
> > unprivileged user namespace you lose all capabilities against your
> > ancestor user namespace and you can't reenter your ancestor user
> > namespace.
> >
> > IOW, if current_user_ns() == token->userns and token->userns !=
> > init_user_ns, then current_user_ns() != init_user_ns. And therefore that
> > thing is essentially always false for all interesting cases, no?
> >
> 
> Are you saying that this would be better?
> 
>    if (token && current_user_ns() == token->userns) {
>        if (ns_capable(token->userns, cap))
>            return true;
>        if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
>            return true;
>        if (token->userns != &init_user_ns)
>            return false;
>    }
>    /* otherwise fallback to capable() checks */
>    return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> 
> 
> I.e., return false directly if token's userns is not initns (there
> will be also LSM check before this condition later on)? Falling back
> to capable() checks and letting it return false if we are not in
> init_ns or don't have capabilities seemed fine to me, that's all.
> 
> 
> > Aside from that it would be semantically completely unclean. The user
> > has specified a token and permission checking should be based on that
> > token and not magically fallback to a capable check in the inital user
> > namespace even if that worked.
> 
> I tried to explain the higher-level integration setup in [0]. The
> thing is that users most of the time won't be explicitly passing a
> token, BPF library will be passing it, if /sys/fs/bpf happens to be
> mounted with delegation options.
> 
> So I wanted to avoid potential regressions (unintended and avoidable
> failures) from using BPF token, because it might be hard to tell if a
> BPF token is "beneficial" and is granting required permissions
> (especially if you take into account LSM interactions). So I
> consistently treat BPF token as optional/add-on permissions, not the
> replacement for capable() checks.

You can always just perform the same call again without specifying the
token.

> 
> It's true that it's unlikely that BPF token will be set up in init_ns
> (except for testing, perhaps), but is it a reason to return -EPERM
> without doing the same checks that would be done if BPF token wasn't
> provided?
> 
> 
>   [0] https://lore.kernel.org/bpf/CAEf4Bzb6jnJL98SLPJB7Vjxo_O33W8HjJuAsyP3+6xigZtsTkA@mail.gmail.com/
> 
> >
> > Because the only scenario where that is maybe useful is if an
> > unprivileged container has dropped _both_ CAP_BPF and CAP_SYS_ADMIN from
> > the user namespace of the container.
> >
> > First of, why? What thread model do you have then? Second, if you do
> > stupid stuff like that then you don't get bpf in the container via bpf
> > tokens. Period.
> >
> > Restrict the meaning and validity of a bpf token to the user namespace
> > and do not include escape hatches such as this. Especially not in this
> > initial version, please.
> 
> This decision fundamentally changes how BPF loader libraries like
> libbpf will have to approach BPF token integration. It's not a small
> thing and not something that will be easy to change later.

Why? It would be relaxing permissions, not restricting it.

> 
> >
> > I'm not trying to be difficult but it's clear that the implications of
> > user namespaces aren't well understood here. And historicaly they are
> 
> I don't know why you are saying this. You haven't pointed out anything
> that is actually broken in the existing implementation. Sure, you
> might not be a fan of the approach, but is there anything
> *technically* wrong with ignoring BPF token if it doesn't provide
> necessary permissions for BPF operation and consistently using the
> checks that would be performed with BPF token?

The current check is inconsisent. It special-cases init_user_ns. The
correct thing to do for what you're intending imho is:

bool bpf_token_capable(const struct bpf_token *token, int cap)
{
        struct user_namespace *userns = &init_user_ns;

        if (token)
                userns = token->userns;
        if (ns_capable(userns, cap))
                return true;
        return cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))

}

Because any caller located in an ancestor user namespace of
token->user_ns will be privileged wrt to the token's userns as long as
they have that capability in their user namespace.

For example, if the caller is in the init_user_ns and permissions
for CAP_WHATEVER is checked for in token->user_ns and the caller has
CAP_WHATEVER in init_user_ns then they also have it in all
descendant user namespaces.

The original intention had been to align with what we require during
token creation meaning that once a token has been created interacting
with this token is specifically confined to caller's located in the
token's user namespace.

If that's not the case then it doesn't make sense to not allow
permission checking based on regular capability semantics. IOW, why
special case init_user_ns if you're breaking the confinement restriction
anyway.
Paul Moore Jan. 10, 2024, 7:29 p.m. UTC | #18
On Mon, Jan 8, 2024 at 7:07 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Mon, Jan 8, 2024 at 8:45 AM Paul Moore <paul@paul-moore.com> wrote:
> >
> > On Fri, Jan 5, 2024 at 4:45 PM Linus Torvalds
> > <torvalds@linuxfoundation.org> wrote:
> > > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > +{
> > > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > > +        * token's userns is *exactly* the same as current user's userns
> > > > +        */
> > > > +       if (token && current_user_ns() == token->userns) {
> > > > +               if (ns_capable(token->userns, cap))
> > > > +                       return true;
> > > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > > +                       return true;
> > > > +       }
> > > > +       /* otherwise fallback to capable() checks */
> > > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > +}
> > >
> > > This *feels* like it should be written as
> > >
> > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > >     {
> > >         struct user_namespace *ns = &init_ns;
> > >
> > >         /* BPF token allows ns_capable() level of capabilities, but only if
> > >          * token's userns is *exactly* the same as current user's userns
> > >          */
> > >         if (token && current_user_ns() == token->userns)
> > >                 ns = token->userns;
> > >         return ns_capable(ns, cap) ||
> > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > >     }
> > >
> > > And yes, I realize that the function will end up later growing a
> > >
> > >         security_bpf_token_capable(token, cap)
> > >
> > > test inside that 'if (token ..)' statement, and this would change the
> > > order of that test so that the LSM hook would now be done before the
> > > capability checks are done, but that all still seems just more of an
> > > argument for the simplification.
> >
> > I have no problem with rewriting things, my only ask is that we stick
> > with the idea of doing the capability checks before the LSM hook.  The
> > DAC-before-MAC (capability-before-LSM) pattern is one we try to stick
> > to most everywhere in the kernel and deviating from it here could
> > potentially result in some odd/unexpected behavior from a user
> > perspective.
>
> Makes sense, Paul. With the suggested rewrite we'll get an LSM call
> before we get to ns_capable() (which we avoid doing in BPF code base,
> generally speaking, after someone called this out earlier). Hmm...
>
> I guess it will be better to keep this logic as is then, I believe it
> was more of a subjective stylistical nit from Linus, so it probably is
> ok to keep existing code.

I didn't read Linus' reply as a mandate, more as a
this-would-be-nice-to-have, and considering the access control
ordering I would just stick with what you have (ignoring Christian's
concerns, I'm only commenting on the LSM related stuff here).

If Linus is *really* upset with how the code is written I suspect
we'll hear from him on that.
Andrii Nakryiko Jan. 11, 2024, 12:42 a.m. UTC | #19
On Wed, Jan 10, 2024 at 6:59 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Jan 09, 2024 at 11:00:24AM -0800, Andrii Nakryiko wrote:
> > On Tue, Jan 9, 2024 at 6:52 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Mon, Jan 08, 2024 at 03:58:47PM -0800, Andrii Nakryiko wrote:
> > > > On Mon, Jan 8, 2024 at 4:02 AM Christian Brauner <brauner@kernel.org> wrote:
> > > > >
> > > > > On Fri, Jan 05, 2024 at 02:18:40PM -0800, Andrii Nakryiko wrote:
> > > > > > On Fri, Jan 5, 2024 at 1:45 PM Linus Torvalds
> > > > > > <torvalds@linuxfoundation.org> wrote:
> > > > > > >
> > > > > > > Ok, I've gone through the whole series now, and I don't find anything
> > > > > > > objectionable.
> > > > > >
> > > > > > That's great, thanks for reviewing!
> > > > > >
> > > > > > >
> > > > > > > Which may only mean that I didn't notice something, of course, but at
> > > > > > > least there's nothing I'd consider obvious.
> > > > > > >
> > > > > > > I keep coming back to this 03/29 patch, because it's kind of the heart
> > > > > > > of it, and I have one more small nit, but it's also purely stylistic:
> > > > > > >
> > > > > > > On Wed, 3 Jan 2024 at 14:21, Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > > > >
> > > > > > > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > > > > +{
> > > > > > > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > > > > > > +        * token's userns is *exactly* the same as current user's userns
> > > > > > > > +        */
> > > > > > > > +       if (token && current_user_ns() == token->userns) {
> > > > > > > > +               if (ns_capable(token->userns, cap))
> > > > > > > > +                       return true;
> > > > > > > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > > > > > > +                       return true;
> > > > > > > > +       }
> > > > > > > > +       /* otherwise fallback to capable() checks */
> > > > > > > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > > > > > +}
> > > > > > >
> > > > > > > This *feels* like it should be written as
> > > > > > >
> > > > > > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > > >     {
> > > > > > >         struct user_namespace *ns = &init_ns;
> > > > > > >
> > > > > > >         /* BPF token allows ns_capable() level of capabilities, but only if
> > > > > > >          * token's userns is *exactly* the same as current user's userns
> > > > > > >          */
> > > > > > >         if (token && current_user_ns() == token->userns)
> > > > > > >                 ns = token->userns;
> > > > > > >         return ns_capable(ns, cap) ||
> > > > > > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > > > >     }
> > > > > > >
> > > > > > > And yes, I realize that the function will end up later growing a
> > > > > > >
> > > > > > >         security_bpf_token_capable(token, cap)
> > > > > > >
> > > > > > > test inside that 'if (token ..)' statement, and this would change the
> > > > > > > order of that test so that the LSM hook would now be done before the
> > > > > > > capability checks are done, but that all still seems just more of an
> > > > > > > argument for the simplification.
> > > > > > >
> > > > > > > So the end result would be something like
> > > > > > >
> > > > > > >     bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > > >     {
> > > > > > >         struct user_namespace *ns = &init_ns;
> > > > > > >
> > > > > > >         if (token && current_user_ns() == token->userns) {
> > > > > > >                 if (security_bpf_token_capable(token, cap) < 0)
> > > > > > >                         return false;
> > > > > > >                 ns = token->userns;
> > > > > > >         }
> > > > > > >         return ns_capable(ns, cap) ||
> > > > > > >                 (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > > > > >     }
> > > > > >
> > > > > > Yep, it makes sense to use ns_capable with init_ns. I'll change those
> > > > > > two patches to end up with something like what you suggested here.
> > > > > >
> > > > > > >
> > > > > > > although I feel that with that LSM hook, maybe this all should return
> > > > > > > the error code (zero or negative), not a bool for success?
> > > > > > >
> > > > > > > Also, should "current_user_ns() != token->userns" perhaps be an error
> > > > > > > condition, rather than a "fall back to init_ns" condition?
> > > > > > >
> > > > > > > Again, none of this is a big deal. I do think you're dropping the LSM
> > > > > > > error code on the floor, and are duplicating the "ns_capable()" vs
> > > > > > > "capable()" logic as-is, but none of this is a deal breaker, just more
> > > > > > > of my commentary on the patch and about the logic here.
> > > > > > >
> > > > > > > And yeah, I don't exactly love how you say "ok, if there's a token and
> > > > > > > it doesn't match, I'll not use it" rather than "if the token namespace
> > > > > > > doesn't match, it's an error", but maybe there's some usability issue
> > > > > > > here?
> > > > > >
> > > > > > Yes, usability was the primary concern. The overall idea with BPF
> > > > >
> > > > > NAK on not restricting this to not erroring out on current_user_ns()
> > > > > != token->user_ns. I've said this multiple times before.
> > > >
> > > > I do restrict token usage to *exact* userns in which the token was
> > > > created. See bpf_token_capable()'s
> > > >
> > > > if (token && current_user_ns() == token->userns) { ... }
> > > >
> > > > and in bpf_token_allow_cmd():
> > > >
> > > > if (!token || current_user_ns() != token->userns)
> > > >     return false;
> > > >
> > > > So I followed what you asked in [1] (just like I said I will in [2]),
> > > > unless I made some stupid mistake which I cannot even see.
> > > >
> > > >
> > > > What we are discussing here is a different question. It's the
> > > > difference between erroring out (that is, failing whatever BPF
> > > > operation was attempted with such token, i.e., program loading or map
> > > > creation) vs ignoring the token altogether and just using
> > > > init_ns-based capable() checks. And the latter is vastly more user
> > >
> > > Look at this:
> > >
> > > +bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > +{
> > > +       /* BPF token allows ns_capable() level of capabilities, but only if
> > > +        * token's userns is *exactly* the same as current user's userns
> > > +        */
> > > +       if (token && current_user_ns() == token->userns) {
> > > +               if (ns_capable(token->userns, cap))
> > > +                       return true;
> > > +               if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> > > +                       return true;
> > > +       }
> > > +       /* otherwise fallback to capable() checks */
> > > +       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> > > +}
> > >
> > > How on earth is it possible that the calling task is in a user namespace
> > > aka current_user_ns() == token->userns while at the same time being
> > > capable in the initial user namespace? When you enter an
> > > unprivileged user namespace you lose all capabilities against your
> > > ancestor user namespace and you can't reenter your ancestor user
> > > namespace.
> > >
> > > IOW, if current_user_ns() == token->userns and token->userns !=
> > > init_user_ns, then current_user_ns() != init_user_ns. And therefore that
> > > thing is essentially always false for all interesting cases, no?
> > >
> >
> > Are you saying that this would be better?
> >
> >    if (token && current_user_ns() == token->userns) {
> >        if (ns_capable(token->userns, cap))
> >            return true;
> >        if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> >            return true;
> >        if (token->userns != &init_user_ns)
> >            return false;
> >    }
> >    /* otherwise fallback to capable() checks */
> >    return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> >
> >
> > I.e., return false directly if token's userns is not initns (there
> > will be also LSM check before this condition later on)? Falling back
> > to capable() checks and letting it return false if we are not in
> > init_ns or don't have capabilities seemed fine to me, that's all.
> >
> >
> > > Aside from that it would be semantically completely unclean. The user
> > > has specified a token and permission checking should be based on that
> > > token and not magically fallback to a capable check in the inital user
> > > namespace even if that worked.
> >
> > I tried to explain the higher-level integration setup in [0]. The
> > thing is that users most of the time won't be explicitly passing a
> > token, BPF library will be passing it, if /sys/fs/bpf happens to be
> > mounted with delegation options.
> >
> > So I wanted to avoid potential regressions (unintended and avoidable
> > failures) from using BPF token, because it might be hard to tell if a
> > BPF token is "beneficial" and is granting required permissions
> > (especially if you take into account LSM interactions). So I
> > consistently treat BPF token as optional/add-on permissions, not the
> > replacement for capable() checks.
>
> You can always just perform the same call again without specifying the
> token.

This has a bunch of problematic implications.

Retrying on any EPERM leads to inefficiency and potential confusion
for users. EPERM can be returned somewhere deeply in the verifier
after spending tons of memory and CPU doing verification. So it's a
waste to try with a token and then try without. And also, retrying
without a token can change specific failure reasons. E.g., for a BPF
program with token we can get deep enough into the verification
process and the verifier will provide log with details about what the
program is doing wrong (allowing the user to fix it relatively
easily), while without token we can bail out much earlier but with not
details what's wrong.

Similarly for other operations (map, BTF), token can provide log
details, etc, but retrying without token will strip users of these
helpful details. That's just to say that dropping a token
automatically doesn't necessarily provide the same user experience
compared to if the token is ignored, if it's not effective (besides
the performance and resource waste implications above).

We have a similar precedent with optional BTF. Libbpf will silently
retry map creation/program loading without BTF automatically. And we
had enough pain with this and invested a lot of work to prevent the
need for retry. We preventively sanitize or drop BTF, etc. For similar
reasons as above (performance and user experience with debugging).

So in short, it's a significant regression in usability and user
experience if the token isn't treated as an add-on permissions,
forcing (otherwise avoidable) complications into user-space libraries
and applications.

>
> >
> > It's true that it's unlikely that BPF token will be set up in init_ns
> > (except for testing, perhaps), but is it a reason to return -EPERM
> > without doing the same checks that would be done if BPF token wasn't
> > provided?
> >
> >
> >   [0] https://lore.kernel.org/bpf/CAEf4Bzb6jnJL98SLPJB7Vjxo_O33W8HjJuAsyP3+6xigZtsTkA@mail.gmail.com/
> >
> > >
> > > Because the only scenario where that is maybe useful is if an
> > > unprivileged container has dropped _both_ CAP_BPF and CAP_SYS_ADMIN from
> > > the user namespace of the container.
> > >
> > > First of, why? What thread model do you have then? Second, if you do
> > > stupid stuff like that then you don't get bpf in the container via bpf
> > > tokens. Period.
> > >
> > > Restrict the meaning and validity of a bpf token to the user namespace
> > > and do not include escape hatches such as this. Especially not in this
> > > initial version, please.
> >
> > This decision fundamentally changes how BPF loader libraries like
> > libbpf will have to approach BPF token integration. It's not a small
> > thing and not something that will be easy to change later.
>
> Why? It would be relaxing permissions, not restricting it.

See above, I tried to highlight some implications, though I realize
it's a bit hard to go over these intricate libbpf implications in a
succinct email without going into excessive details about how BPF
development and debugging is done nowadays with libbpf and
libbpf-based tooling.

>
> >
> > >
> > > I'm not trying to be difficult but it's clear that the implications of
> > > user namespaces aren't well understood here. And historicaly they are
> >
> > I don't know why you are saying this. You haven't pointed out anything
> > that is actually broken in the existing implementation. Sure, you
> > might not be a fan of the approach, but is there anything
> > *technically* wrong with ignoring BPF token if it doesn't provide
> > necessary permissions for BPF operation and consistently using the
> > checks that would be performed with BPF token?
>
> The current check is inconsisent. It special-cases init_user_ns. The
> correct thing to do for what you're intending imho is:
>
> bool bpf_token_capable(const struct bpf_token *token, int cap)
> {
>         struct user_namespace *userns = &init_user_ns;
>
>         if (token)
>                 userns = token->userns;
>         if (ns_capable(userns, cap))
>                 return true;
>         return cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
>
> }

Unfortunately the above becomes significantly more hairy when LSM
(security_bpf_token_capable) gets involved, while preserving the rule
"if token doesn't give rights, fall back to init userns checks".

I'm happy to accommodate any implementation of bpf_token_capable() as
long as it behaves as discussed above and also satisfies Paul's
requirement that capability checks should happen before LSM checks.

>
> Because any caller located in an ancestor user namespace of
> token->user_ns will be privileged wrt to the token's userns as long as
> they have that capability in their user namespace.

And with `current_user_ns() == token->userns` check we won't be using
token->userns while the caller is in ancestor user namespace, we'll
use capable() check, which will succeed only in init user_ns, assuming
corresponding CAP_xxx is actually set.

>
> For example, if the caller is in the init_user_ns and permissions
> for CAP_WHATEVER is checked for in token->user_ns and the caller has
> CAP_WHATEVER in init_user_ns then they also have it in all
> descendant user namespaces.

Right, so if they didn't use a token they would still pass
capable(CAP_WHATEVER), right?

>
> The original intention had been to align with what we require during
> token creation meaning that once a token has been created interacting
> with this token is specifically confined to caller's located in the
> token's user namespace.
>
> If that's not the case then it doesn't make sense to not allow
> permission checking based on regular capability semantics. IOW, why
> special case init_user_ns if you're breaking the confinement restriction
> anyway.

I'm sorry, perhaps I'm dense, but with `current_user_ns() ==
token->userns` check I think we do fulfill the intention to not allow
using a token in a userns different from the one in which it was
created. If that condition isn't satisfied, the token is immediately
ignored. So you can't use a token from another userns for anything,
it's just not there, effectively.

And as I tried to explain above, I do think that ignoring the token
instead of erroring out early is what we want to provide good
user-space ecosystem integration of BPF token.
Christian Brauner Jan. 11, 2024, 10:38 a.m. UTC | #20
> > The current check is inconsisent. It special-cases init_user_ns. The
> > correct thing to do for what you're intending imho is:
> >
> > bool bpf_token_capable(const struct bpf_token *token, int cap)
> > {
> >         struct user_namespace *userns = &init_user_ns;
> >
> >         if (token)
> >                 userns = token->userns;
> >         if (ns_capable(userns, cap))
> >                 return true;
> >         return cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
> >
> > }
> 
> Unfortunately the above becomes significantly more hairy when LSM
> (security_bpf_token_capable) gets involved, while preserving the rule
> "if token doesn't give rights, fall back to init userns checks".

Why? Please explain your reasoning in detail.

> 
> I'm happy to accommodate any implementation of bpf_token_capable() as
> long as it behaves as discussed above and also satisfies Paul's
> requirement that capability checks should happen before LSM checks.
> 
> >
> > Because any caller located in an ancestor user namespace of
> > token->user_ns will be privileged wrt to the token's userns as long as
> > they have that capability in their user namespace.
> 
> And with `current_user_ns() == token->userns` check we won't be using
> token->userns while the caller is in ancestor user namespace, we'll
> use capable() check, which will succeed only in init user_ns, assuming
> corresponding CAP_xxx is actually set.

Why? This isn't how any of our ns_capable() logic works.

This basically argues that anyone in an ancestor user namespace is not
allowed to operate on any of their descendant child namespaces unless
they are in the init_user_ns.

But that's nonsense as I'm trying to tell you. Any process in an
ancestor user namespace that is privileged over the child namespace can
just setns() into it and then pass that bpf_token_capable() check by
supplying the token.

At this point just do it properly and allow callers that are privileged
in the token user namespace to load bpf programs. It also means you get
user namespace nesting done properly.

> 
> >
> > For example, if the caller is in the init_user_ns and permissions
> > for CAP_WHATEVER is checked for in token->user_ns and the caller has
> > CAP_WHATEVER in init_user_ns then they also have it in all
> > descendant user namespaces.
> 
> Right, so if they didn't use a token they would still pass
> capable(CAP_WHATEVER), right?

Yes, I'm trying to accomodate your request but making it work
consistently.

> 
> >
> > The original intention had been to align with what we require during
> > token creation meaning that once a token has been created interacting
> > with this token is specifically confined to caller's located in the
> > token's user namespace.
> >
> > If that's not the case then it doesn't make sense to not allow
> > permission checking based on regular capability semantics. IOW, why
> > special case init_user_ns if you're breaking the confinement restriction
> > anyway.
> 
> I'm sorry, perhaps I'm dense, but with `current_user_ns() ==
> token->userns` check I think we do fulfill the intention to not allow
> using a token in a userns different from the one in which it was
> created. If that condition isn't satisfied, the token is immediately

My request originally was about never being able to interact with a
token outside of that userns. This is different as you provide an escape
hatch to init_user_ns. But if you need that and ignore the token then
please do it properly. That's what I'm trying to tell you. See below.

> ignored. So you can't use a token from another userns for anything,
> it's just not there, effectively.
> 
> And as I tried to explain above, I do think that ignoring the token
> instead of erroring out early is what we want to provide good
> user-space ecosystem integration of BPF token.

There is no erroring out early in. It's:

(1) Has a token been provided and is the caller capable wrt to the
    namespace of the token? Any caller in an ancestor user namespace
    that has the capability in that user namespace is capable wrt to
    that token. That __includes__ a callers in the init_user_ns. IOW,
    you don't need to fallback to any special checking for init_user_ns.
    It is literally covered in the if (token) branch with the added
    consistency that a process in an ancestor user namespace is
    privileged wrt to that token as well.

(2) No token has been provided. Then do what we always did and perform
    the capability checks based on the initial user namespace.

The only thing that you then still need is add that token_capable() hook
in there:

bool bpf_token_capable(const struct bpf_token *token, int cap)
{
	bool has_cap;
        struct user_namespace *userns = &init_user_ns;

        if (token)
                userns = token->userns;
        if (ns_capable(userns, cap))
                return true;
        if (cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
		return token ? security_bpf_token_capable(token, cap) == 0 : true;
	return false;
}

Or write it however you like. I think this is way more consistent and
gives you a more flexible permission model.
Andrii Nakryiko Jan. 11, 2024, 5:41 p.m. UTC | #21
On Thu, Jan 11, 2024 at 2:38 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > > The current check is inconsisent. It special-cases init_user_ns. The
> > > correct thing to do for what you're intending imho is:
> > >
> > > bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > {
> > >         struct user_namespace *userns = &init_user_ns;
> > >
> > >         if (token)
> > >                 userns = token->userns;
> > >         if (ns_capable(userns, cap))
> > >                 return true;
> > >         return cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
> > >
> > > }
> >
> > Unfortunately the above becomes significantly more hairy when LSM
> > (security_bpf_token_capable) gets involved, while preserving the rule
> > "if token doesn't give rights, fall back to init userns checks".
>
> Why? Please explain your reasoning in detail.

Why which part? About LSM interaction making this much hairier? Then see below.

But if your "why?" is about "pretend no token, if token doesn't give
rights", then that's what I tried to explain in my last email(s). It
significantly alters (for the worse) user-space integration story
(providing a token can be a regression, so now it's not safe to
opportunistically try to create and use BPF token; on the other hand,
automatically retrying inside libbpf makes for confusing user
experience and inefficiencies). Please let me know which parts are not
clear.

>
> >
> > I'm happy to accommodate any implementation of bpf_token_capable() as
> > long as it behaves as discussed above and also satisfies Paul's
> > requirement that capability checks should happen before LSM checks.
> >
> > >
> > > Because any caller located in an ancestor user namespace of
> > > token->user_ns will be privileged wrt to the token's userns as long as
> > > they have that capability in their user namespace.
> >
> > And with `current_user_ns() == token->userns` check we won't be using
> > token->userns while the caller is in ancestor user namespace, we'll
> > use capable() check, which will succeed only in init user_ns, assuming
> > corresponding CAP_xxx is actually set.
>
> Why? This isn't how any of our ns_capable() logic works.
>
> This basically argues that anyone in an ancestor user namespace is not
> allowed to operate on any of their descendant child namespaces unless
> they are in the init_user_ns.
>
> But that's nonsense as I'm trying to tell you. Any process in an
> ancestor user namespace that is privileged over the child namespace can
> just setns() into it and then pass that bpf_token_capable() check by
> supplying the token.
>
> At this point just do it properly and allow callers that are privileged
> in the token user namespace to load bpf programs. It also means you get
> user namespace nesting done properly.

Ok, I see. This `current_user_ns() == token->userns` check prevents
this part of cap_capable() to ever be exercised:

 if ((ns->parent == cred->user_ns) && uid_eq(ns->owner, cred->euid))
    return 0;

Got it. I'm all for not adding any unnecessary restrictions.

>
> >
> > >
> > > For example, if the caller is in the init_user_ns and permissions
> > > for CAP_WHATEVER is checked for in token->user_ns and the caller has
> > > CAP_WHATEVER in init_user_ns then they also have it in all
> > > descendant user namespaces.
> >
> > Right, so if they didn't use a token they would still pass
> > capable(CAP_WHATEVER), right?
>
> Yes, I'm trying to accomodate your request but making it work
> consistently.
>
> >
> > >
> > > The original intention had been to align with what we require during
> > > token creation meaning that once a token has been created interacting
> > > with this token is specifically confined to caller's located in the
> > > token's user namespace.
> > >
> > > If that's not the case then it doesn't make sense to not allow
> > > permission checking based on regular capability semantics. IOW, why
> > > special case init_user_ns if you're breaking the confinement restriction
> > > anyway.
> >
> > I'm sorry, perhaps I'm dense, but with `current_user_ns() ==
> > token->userns` check I think we do fulfill the intention to not allow
> > using a token in a userns different from the one in which it was
> > created. If that condition isn't satisfied, the token is immediately
>
> My request originally was about never being able to interact with a
> token outside of that userns. This is different as you provide an escape
> hatch to init_user_ns. But if you need that and ignore the token then
> please do it properly. That's what I'm trying to tell you. See below.

Yes, I do need that. Thanks for providing the full code implementation
(including LSM), it's much easier this way to converge. Let's see
below.

>
> > ignored. So you can't use a token from another userns for anything,
> > it's just not there, effectively.
> >
> > And as I tried to explain above, I do think that ignoring the token
> > instead of erroring out early is what we want to provide good
> > user-space ecosystem integration of BPF token.
>
> There is no erroring out early in. It's:
>
> (1) Has a token been provided and is the caller capable wrt to the
>     namespace of the token? Any caller in an ancestor user namespace
>     that has the capability in that user namespace is capable wrt to
>     that token. That __includes__ a callers in the init_user_ns. IOW,
>     you don't need to fallback to any special checking for init_user_ns.
>     It is literally covered in the if (token) branch with the added
>     consistency that a process in an ancestor user namespace is
>     privileged wrt to that token as well.
>
> (2) No token has been provided. Then do what we always did and perform
>     the capability checks based on the initial user namespace.
>
> The only thing that you then still need is add that token_capable() hook
> in there:
>
> bool bpf_token_capable(const struct bpf_token *token, int cap)
> {
>         bool has_cap;
>         struct user_namespace *userns = &init_user_ns;
>
>         if (token)
>                 userns = token->userns;
>         if (ns_capable(userns, cap))

Here, we still need to check security_bpf_token_capable(token, cap)
result (and only if token != NULL). And if LSM returns < 0, then drop
the token and do the original init userns check.

And I just realized that my original implementation has the same
problem. In my current implementation if we have a token we will
terminate at LSM call, regardless if LSM allows or disallows the
token. But that's inconsistent behavior and shouldn't be like that.

I will add new tests that validate LSM interactions in the next revision.

>                 return true;
>         if (cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
>                 return token ? security_bpf_token_capable(token, cap) == 0 : true;

here as well, even if we have a token which passes ns_capable() check,
but LSM rejects this token, we still need to forget about the token
and do capable() checks in init userns.

>         return false;
> }
>
> Or write it however you like. I think this is way more consistent and
> gives you a more flexible permission model.

Yes, I like it, thanks. Taking into account fixed LSM interactions,
here's what I came up with. Yell if you can spot anything wrong (or
just hate the style). I did have a version without extra function,
just setting the token to NULL and "goto again" approach, but I think
it's way less readable and harder to follow. So this is my version
right now:

static bool bpf_ns_capable(struct user_namespace *ns, int cap)
{
        return ns_capable(ns, cap) || (cap != CAP_SYS_ADMIN &&
ns_capable(ns, CAP_SYS_ADMIN));
}

static bool token_capable(const struct bpf_token *token, int cap)
{
        struct user_namespace *userns;

        userns = token ? token->userns : &init_user_ns;
        if (!bpf_ns_capable(userns, cap))
                return false;
        if (token && security_bpf_token_capable(token, cap) < 0)
                return false;
        return true;
}

bool bpf_token_capable(const struct bpf_token *token, int cap)
{
        /* BPF token allows ns_capable() level of capabilities, but if it
         * doesn't grant required capabilities, ignore token and fallback to
         * init userns-based checks
         */
        if (token && token_capable(token, cap))
                return true;
        return token_capable(NULL, cap);
}
Christian Brauner Jan. 12, 2024, 7:58 a.m. UTC | #22
On Thu, Jan 11, 2024 at 09:41:25AM -0800, Andrii Nakryiko wrote:
> On Thu, Jan 11, 2024 at 2:38 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > > > The current check is inconsisent. It special-cases init_user_ns. The
> > > > correct thing to do for what you're intending imho is:
> > > >
> > > > bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > {
> > > >         struct user_namespace *userns = &init_user_ns;
> > > >
> > > >         if (token)
> > > >                 userns = token->userns;
> > > >         if (ns_capable(userns, cap))
> > > >                 return true;
> > > >         return cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
> > > >
> > > > }
> > >
> > > Unfortunately the above becomes significantly more hairy when LSM
> > > (security_bpf_token_capable) gets involved, while preserving the rule
> > > "if token doesn't give rights, fall back to init userns checks".
> >
> > Why? Please explain your reasoning in detail.
> 
> Why which part? About LSM interaction making this much hairier? Then see below.
> 
> But if your "why?" is about "pretend no token, if token doesn't give
> rights", then that's what I tried to explain in my last email(s). It
> significantly alters (for the worse) user-space integration story
> (providing a token can be a regression, so now it's not safe to
> opportunistically try to create and use BPF token; on the other hand,
> automatically retrying inside libbpf makes for confusing user
> experience and inefficiencies). Please let me know which parts are not
> clear.
> 
> >
> > >
> > > I'm happy to accommodate any implementation of bpf_token_capable() as
> > > long as it behaves as discussed above and also satisfies Paul's
> > > requirement that capability checks should happen before LSM checks.
> > >
> > > >
> > > > Because any caller located in an ancestor user namespace of
> > > > token->user_ns will be privileged wrt to the token's userns as long as
> > > > they have that capability in their user namespace.
> > >
> > > And with `current_user_ns() == token->userns` check we won't be using
> > > token->userns while the caller is in ancestor user namespace, we'll
> > > use capable() check, which will succeed only in init user_ns, assuming
> > > corresponding CAP_xxx is actually set.
> >
> > Why? This isn't how any of our ns_capable() logic works.
> >
> > This basically argues that anyone in an ancestor user namespace is not
> > allowed to operate on any of their descendant child namespaces unless
> > they are in the init_user_ns.
> >
> > But that's nonsense as I'm trying to tell you. Any process in an
> > ancestor user namespace that is privileged over the child namespace can
> > just setns() into it and then pass that bpf_token_capable() check by
> > supplying the token.
> >
> > At this point just do it properly and allow callers that are privileged
> > in the token user namespace to load bpf programs. It also means you get
> > user namespace nesting done properly.
> 
> Ok, I see. This `current_user_ns() == token->userns` check prevents
> this part of cap_capable() to ever be exercised:
> 
>  if ((ns->parent == cred->user_ns) && uid_eq(ns->owner, cred->euid))
>     return 0;
> 
> Got it. I'm all for not adding any unnecessary restrictions.
> 
> >
> > >
> > > >
> > > > For example, if the caller is in the init_user_ns and permissions
> > > > for CAP_WHATEVER is checked for in token->user_ns and the caller has
> > > > CAP_WHATEVER in init_user_ns then they also have it in all
> > > > descendant user namespaces.
> > >
> > > Right, so if they didn't use a token they would still pass
> > > capable(CAP_WHATEVER), right?
> >
> > Yes, I'm trying to accomodate your request but making it work
> > consistently.
> >
> > >
> > > >
> > > > The original intention had been to align with what we require during
> > > > token creation meaning that once a token has been created interacting
> > > > with this token is specifically confined to caller's located in the
> > > > token's user namespace.
> > > >
> > > > If that's not the case then it doesn't make sense to not allow
> > > > permission checking based on regular capability semantics. IOW, why
> > > > special case init_user_ns if you're breaking the confinement restriction
> > > > anyway.
> > >
> > > I'm sorry, perhaps I'm dense, but with `current_user_ns() ==
> > > token->userns` check I think we do fulfill the intention to not allow
> > > using a token in a userns different from the one in which it was
> > > created. If that condition isn't satisfied, the token is immediately
> >
> > My request originally was about never being able to interact with a
> > token outside of that userns. This is different as you provide an escape
> > hatch to init_user_ns. But if you need that and ignore the token then
> > please do it properly. That's what I'm trying to tell you. See below.
> 
> Yes, I do need that. Thanks for providing the full code implementation
> (including LSM), it's much easier this way to converge. Let's see
> below.
> 
> >
> > > ignored. So you can't use a token from another userns for anything,
> > > it's just not there, effectively.
> > >
> > > And as I tried to explain above, I do think that ignoring the token
> > > instead of erroring out early is what we want to provide good
> > > user-space ecosystem integration of BPF token.
> >
> > There is no erroring out early in. It's:
> >
> > (1) Has a token been provided and is the caller capable wrt to the
> >     namespace of the token? Any caller in an ancestor user namespace
> >     that has the capability in that user namespace is capable wrt to
> >     that token. That __includes__ a callers in the init_user_ns. IOW,
> >     you don't need to fallback to any special checking for init_user_ns.
> >     It is literally covered in the if (token) branch with the added
> >     consistency that a process in an ancestor user namespace is
> >     privileged wrt to that token as well.
> >
> > (2) No token has been provided. Then do what we always did and perform
> >     the capability checks based on the initial user namespace.
> >
> > The only thing that you then still need is add that token_capable() hook
> > in there:
> >
> > bool bpf_token_capable(const struct bpf_token *token, int cap)
> > {
> >         bool has_cap;
> >         struct user_namespace *userns = &init_user_ns;
> >
> >         if (token)
> >                 userns = token->userns;
> >         if (ns_capable(userns, cap))
> 
> Here, we still need to check security_bpf_token_capable(token, cap)
> result (and only if token != NULL). And if LSM returns < 0, then drop
> the token and do the original init userns check.
> 
> And I just realized that my original implementation has the same
> problem. In my current implementation if we have a token we will
> terminate at LSM call, regardless if LSM allows or disallows the
> token. But that's inconsistent behavior and shouldn't be like that.
> 
> I will add new tests that validate LSM interactions in the next revision.
> 
> >                 return true;
> >         if (cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
> >                 return token ? security_bpf_token_capable(token, cap) == 0 : true;
> 
> here as well, even if we have a token which passes ns_capable() check,
> but LSM rejects this token, we still need to forget about the token
> and do capable() checks in init userns.
> 
> >         return false;
> > }
> >
> > Or write it however you like. I think this is way more consistent and
> > gives you a more flexible permission model.
> 
> Yes, I like it, thanks. Taking into account fixed LSM interactions,
> here's what I came up with. Yell if you can spot anything wrong (or
> just hate the style). I did have a version without extra function,
> just setting the token to NULL and "goto again" approach, but I think
> it's way less readable and harder to follow. So this is my version
> right now:
> 
> static bool bpf_ns_capable(struct user_namespace *ns, int cap)
> {
>         return ns_capable(ns, cap) || (cap != CAP_SYS_ADMIN &&
> ns_capable(ns, CAP_SYS_ADMIN));
> }
> 
> static bool token_capable(const struct bpf_token *token, int cap)
> {
>         struct user_namespace *userns;
> 
>         userns = token ? token->userns : &init_user_ns;
>         if (!bpf_ns_capable(userns, cap))
>                 return false;
>         if (token && security_bpf_token_capable(token, cap) < 0)
>                 return false;
>         return true;
> }
> 
> bool bpf_token_capable(const struct bpf_token *token, int cap)
> {
>         /* BPF token allows ns_capable() level of capabilities, but if it
>          * doesn't grant required capabilities, ignore token and fallback to
>          * init userns-based checks
>          */
>         if (token && token_capable(token, cap))
>                 return true;
>         return token_capable(NULL, cap);
> }

My point is that the capable logic will walk upwards the user namespace
hierarchy from the token->userns until the user namespace of the caller
and terminate when it reached the init_user_ns.

A caller is located in some namespace at the point where they call this
function. They provided a token. The caller isn't capable in the
namespace of the token so the function falls back to init_user_ns. Two
interesting cases:

(1) The caller wasn't in an ancestor userns of the token. If that's the
    case then it follows that the caller also wasn't in the init_user_ns
    because the init_user_ns is a descendant of all other user
    namespaces. So falling back will fail.

(2) The caller was in the same or an ancestor user namespace of the
    token but didn't have the capability in that user namespace:
    
     (i) They were in a non-init_user_ns. Therefore they can't be
         privileged in init_user_ns.
    (ii) They were in init_user_ns. Therefore, they lacked privileges in
         the init_user_ns.
    
In both cases your fallback will do nothing iiuc.
Andrii Nakryiko Jan. 12, 2024, 6:32 p.m. UTC | #23
On Thu, Jan 11, 2024 at 11:58 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Jan 11, 2024 at 09:41:25AM -0800, Andrii Nakryiko wrote:
> > On Thu, Jan 11, 2024 at 2:38 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > > > The current check is inconsisent. It special-cases init_user_ns. The
> > > > > correct thing to do for what you're intending imho is:
> > > > >
> > > > > bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > > > {
> > > > >         struct user_namespace *userns = &init_user_ns;
> > > > >
> > > > >         if (token)
> > > > >                 userns = token->userns;
> > > > >         if (ns_capable(userns, cap))
> > > > >                 return true;
> > > > >         return cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
> > > > >
> > > > > }
> > > >
> > > > Unfortunately the above becomes significantly more hairy when LSM
> > > > (security_bpf_token_capable) gets involved, while preserving the rule
> > > > "if token doesn't give rights, fall back to init userns checks".
> > >
> > > Why? Please explain your reasoning in detail.
> >
> > Why which part? About LSM interaction making this much hairier? Then see below.
> >
> > But if your "why?" is about "pretend no token, if token doesn't give
> > rights", then that's what I tried to explain in my last email(s). It
> > significantly alters (for the worse) user-space integration story
> > (providing a token can be a regression, so now it's not safe to
> > opportunistically try to create and use BPF token; on the other hand,
> > automatically retrying inside libbpf makes for confusing user
> > experience and inefficiencies). Please let me know which parts are not
> > clear.
> >
> > >
> > > >
> > > > I'm happy to accommodate any implementation of bpf_token_capable() as
> > > > long as it behaves as discussed above and also satisfies Paul's
> > > > requirement that capability checks should happen before LSM checks.
> > > >
> > > > >
> > > > > Because any caller located in an ancestor user namespace of
> > > > > token->user_ns will be privileged wrt to the token's userns as long as
> > > > > they have that capability in their user namespace.
> > > >
> > > > And with `current_user_ns() == token->userns` check we won't be using
> > > > token->userns while the caller is in ancestor user namespace, we'll
> > > > use capable() check, which will succeed only in init user_ns, assuming
> > > > corresponding CAP_xxx is actually set.
> > >
> > > Why? This isn't how any of our ns_capable() logic works.
> > >
> > > This basically argues that anyone in an ancestor user namespace is not
> > > allowed to operate on any of their descendant child namespaces unless
> > > they are in the init_user_ns.
> > >
> > > But that's nonsense as I'm trying to tell you. Any process in an
> > > ancestor user namespace that is privileged over the child namespace can
> > > just setns() into it and then pass that bpf_token_capable() check by
> > > supplying the token.
> > >
> > > At this point just do it properly and allow callers that are privileged
> > > in the token user namespace to load bpf programs. It also means you get
> > > user namespace nesting done properly.
> >
> > Ok, I see. This `current_user_ns() == token->userns` check prevents
> > this part of cap_capable() to ever be exercised:
> >
> >  if ((ns->parent == cred->user_ns) && uid_eq(ns->owner, cred->euid))
> >     return 0;
> >
> > Got it. I'm all for not adding any unnecessary restrictions.
> >
> > >
> > > >
> > > > >
> > > > > For example, if the caller is in the init_user_ns and permissions
> > > > > for CAP_WHATEVER is checked for in token->user_ns and the caller has
> > > > > CAP_WHATEVER in init_user_ns then they also have it in all
> > > > > descendant user namespaces.
> > > >
> > > > Right, so if they didn't use a token they would still pass
> > > > capable(CAP_WHATEVER), right?
> > >
> > > Yes, I'm trying to accomodate your request but making it work
> > > consistently.
> > >
> > > >
> > > > >
> > > > > The original intention had been to align with what we require during
> > > > > token creation meaning that once a token has been created interacting
> > > > > with this token is specifically confined to caller's located in the
> > > > > token's user namespace.
> > > > >
> > > > > If that's not the case then it doesn't make sense to not allow
> > > > > permission checking based on regular capability semantics. IOW, why
> > > > > special case init_user_ns if you're breaking the confinement restriction
> > > > > anyway.
> > > >
> > > > I'm sorry, perhaps I'm dense, but with `current_user_ns() ==
> > > > token->userns` check I think we do fulfill the intention to not allow
> > > > using a token in a userns different from the one in which it was
> > > > created. If that condition isn't satisfied, the token is immediately
> > >
> > > My request originally was about never being able to interact with a
> > > token outside of that userns. This is different as you provide an escape
> > > hatch to init_user_ns. But if you need that and ignore the token then
> > > please do it properly. That's what I'm trying to tell you. See below.
> >
> > Yes, I do need that. Thanks for providing the full code implementation
> > (including LSM), it's much easier this way to converge. Let's see
> > below.
> >
> > >
> > > > ignored. So you can't use a token from another userns for anything,
> > > > it's just not there, effectively.
> > > >
> > > > And as I tried to explain above, I do think that ignoring the token
> > > > instead of erroring out early is what we want to provide good
> > > > user-space ecosystem integration of BPF token.
> > >
> > > There is no erroring out early in. It's:
> > >
> > > (1) Has a token been provided and is the caller capable wrt to the
> > >     namespace of the token? Any caller in an ancestor user namespace
> > >     that has the capability in that user namespace is capable wrt to
> > >     that token. That __includes__ a callers in the init_user_ns. IOW,
> > >     you don't need to fallback to any special checking for init_user_ns.
> > >     It is literally covered in the if (token) branch with the added
> > >     consistency that a process in an ancestor user namespace is
> > >     privileged wrt to that token as well.
> > >
> > > (2) No token has been provided. Then do what we always did and perform
> > >     the capability checks based on the initial user namespace.
> > >
> > > The only thing that you then still need is add that token_capable() hook
> > > in there:
> > >
> > > bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > {
> > >         bool has_cap;
> > >         struct user_namespace *userns = &init_user_ns;
> > >
> > >         if (token)
> > >                 userns = token->userns;
> > >         if (ns_capable(userns, cap))
> >
> > Here, we still need to check security_bpf_token_capable(token, cap)
> > result (and only if token != NULL). And if LSM returns < 0, then drop
> > the token and do the original init userns check.
> >
> > And I just realized that my original implementation has the same
> > problem. In my current implementation if we have a token we will
> > terminate at LSM call, regardless if LSM allows or disallows the
> > token. But that's inconsistent behavior and shouldn't be like that.
> >
> > I will add new tests that validate LSM interactions in the next revision.
> >
> > >                 return true;
> > >         if (cap != CAP_SYS_ADMIN && ns_capable(userns, CAP_SYS_ADMIN))
> > >                 return token ? security_bpf_token_capable(token, cap) == 0 : true;
> >
> > here as well, even if we have a token which passes ns_capable() check,
> > but LSM rejects this token, we still need to forget about the token
> > and do capable() checks in init userns.
> >
> > >         return false;
> > > }
> > >
> > > Or write it however you like. I think this is way more consistent and
> > > gives you a more flexible permission model.
> >
> > Yes, I like it, thanks. Taking into account fixed LSM interactions,
> > here's what I came up with. Yell if you can spot anything wrong (or
> > just hate the style). I did have a version without extra function,
> > just setting the token to NULL and "goto again" approach, but I think
> > it's way less readable and harder to follow. So this is my version
> > right now:
> >
> > static bool bpf_ns_capable(struct user_namespace *ns, int cap)
> > {
> >         return ns_capable(ns, cap) || (cap != CAP_SYS_ADMIN &&
> > ns_capable(ns, CAP_SYS_ADMIN));
> > }
> >
> > static bool token_capable(const struct bpf_token *token, int cap)
> > {
> >         struct user_namespace *userns;
> >
> >         userns = token ? token->userns : &init_user_ns;
> >         if (!bpf_ns_capable(userns, cap))
> >                 return false;
> >         if (token && security_bpf_token_capable(token, cap) < 0)
> >                 return false;
> >         return true;
> > }
> >
> > bool bpf_token_capable(const struct bpf_token *token, int cap)
> > {
> >         /* BPF token allows ns_capable() level of capabilities, but if it
> >          * doesn't grant required capabilities, ignore token and fallback to
> >          * init userns-based checks
> >          */
> >         if (token && token_capable(token, cap))
> >                 return true;
> >         return token_capable(NULL, cap);
> > }
>
> My point is that the capable logic will walk upwards the user namespace
> hierarchy from the token->userns until the user namespace of the caller
> and terminate when it reached the init_user_ns.
>
> A caller is located in some namespace at the point where they call this
> function. They provided a token. The caller isn't capable in the
> namespace of the token so the function falls back to init_user_ns. Two
> interesting cases:
>
> (1) The caller wasn't in an ancestor userns of the token. If that's the
>     case then it follows that the caller also wasn't in the init_user_ns
>     because the init_user_ns is a descendant of all other user
>     namespaces. So falling back will fail.

agreed

>
> (2) The caller was in the same or an ancestor user namespace of the
>     token but didn't have the capability in that user namespace:
>
>      (i) They were in a non-init_user_ns. Therefore they can't be
>          privileged in init_user_ns.
>     (ii) They were in init_user_ns. Therefore, they lacked privileges in
>          the init_user_ns.
>
> In both cases your fallback will do nothing iiuc.

agreed as well

And I agree in general that there isn't a *practically useful* case
where this would matter much. But there is still (at least one) case
where there could be a regression: if token is created in
init_user_ns, caller has CAP_BPF in init_user_ns, caller passes that
token to BPF_PROG_LOAD, and LSM policy rejects that token in
security_bpf_token_capable(). Without the above implementation such
operation will be rejected, even though if there was no token passed
it would succeed. With my implementation above it will succeed as
expected.

Again, I get that those are unlikely corner cases. But this is kernel
API and it should behave consistently. I'd like to avoid having an
asterisk listing exceptional cases where behavior might not be logical
(however unlikely) and stuff like that. The promise is simple:
"providing a token can never regress permissions an application would
have without a token". And libraries like libbpf then can take that as
a contract to work with.

I like this latest implementation above as it is straightforward to
follow and satisfies that contract "by construction":

bool bpf_token_capable(const struct bpf_token *token, int cap)
{
       if (token && token_capable(token, cap))
            return true;
       return token_capable(NULL, cap);
}

So in summary, assuming we are converging and the above
bpf_token_capable() implementation doesn't have any more hidden
issues, I only have s/kvzalloc/kzalloc/, s/kvfree/kfree/ change that
Linus asked for on the kernel side. I'm planning to roll those into
corresponding existing patches. Besides that I'm adding a few new
tests to validate LSM interactions, which I'll add as a separate patch
to the series.

I'm thinking of sending the bpf/bpf-next patch set for v2, just like I
did with v1. But if PR directly to Linus is more preferable, please
someone do let me know, thank you!
Christian Brauner Jan. 12, 2024, 7:16 p.m. UTC | #24
> > My point is that the capable logic will walk upwards the user namespace
> > hierarchy from the token->userns until the user namespace of the caller
> > and terminate when it reached the init_user_ns.
> >
> > A caller is located in some namespace at the point where they call this
> > function. They provided a token. The caller isn't capable in the
> > namespace of the token so the function falls back to init_user_ns. Two
> > interesting cases:
> >
> > (1) The caller wasn't in an ancestor userns of the token. If that's the
> >     case then it follows that the caller also wasn't in the init_user_ns
> >     because the init_user_ns is a descendant of all other user
> >     namespaces. So falling back will fail.
> 
> agreed
> 
> >
> > (2) The caller was in the same or an ancestor user namespace of the
> >     token but didn't have the capability in that user namespace:
> >
> >      (i) They were in a non-init_user_ns. Therefore they can't be
> >          privileged in init_user_ns.
> >     (ii) They were in init_user_ns. Therefore, they lacked privileges in
> >          the init_user_ns.
> >
> > In both cases your fallback will do nothing iiuc.
> 
> agreed as well
> 
> And I agree in general that there isn't a *practically useful* case
> where this would matter much. But there is still (at least one) case
> where there could be a regression: if token is created in
> init_user_ns, caller has CAP_BPF in init_user_ns, caller passes that
> token to BPF_PROG_LOAD, and LSM policy rejects that token in
> security_bpf_token_capable(). Without the above implementation such
> operation will be rejected, even though if there was no token passed
> it would succeed. With my implementation above it will succeed as
> expected.

If that's the case then prevent the creation of tokens in the
init_user_ns and be done with it. If you fallback anyway then this is
the correct solution.

Make this change, please. I'm not willing to support this weird fallback
stuff which is even hard to reason about.
Andrii Nakryiko Jan. 14, 2024, 2:29 a.m. UTC | #25
On Fri, Jan 12, 2024 at 11:17 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > > My point is that the capable logic will walk upwards the user namespace
> > > hierarchy from the token->userns until the user namespace of the caller
> > > and terminate when it reached the init_user_ns.
> > >
> > > A caller is located in some namespace at the point where they call this
> > > function. They provided a token. The caller isn't capable in the
> > > namespace of the token so the function falls back to init_user_ns. Two
> > > interesting cases:
> > >
> > > (1) The caller wasn't in an ancestor userns of the token. If that's the
> > >     case then it follows that the caller also wasn't in the init_user_ns
> > >     because the init_user_ns is a descendant of all other user
> > >     namespaces. So falling back will fail.
> >
> > agreed
> >
> > >
> > > (2) The caller was in the same or an ancestor user namespace of the
> > >     token but didn't have the capability in that user namespace:
> > >
> > >      (i) They were in a non-init_user_ns. Therefore they can't be
> > >          privileged in init_user_ns.
> > >     (ii) They were in init_user_ns. Therefore, they lacked privileges in
> > >          the init_user_ns.
> > >
> > > In both cases your fallback will do nothing iiuc.
> >
> > agreed as well
> >
> > And I agree in general that there isn't a *practically useful* case
> > where this would matter much. But there is still (at least one) case
> > where there could be a regression: if token is created in
> > init_user_ns, caller has CAP_BPF in init_user_ns, caller passes that
> > token to BPF_PROG_LOAD, and LSM policy rejects that token in
> > security_bpf_token_capable(). Without the above implementation such
> > operation will be rejected, even though if there was no token passed
> > it would succeed. With my implementation above it will succeed as
> > expected.
>
> If that's the case then prevent the creation of tokens in the
> init_user_ns and be done with it. If you fallback anyway then this is
> the correct solution.
>
> Make this change, please. I'm not willing to support this weird fallback
> stuff which is even hard to reason about.

Alright, added an extra check. Ok, so in summary I have the changes
below compared to v1 (plus a few extra LSM-related test cases added):

diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index a86fccd57e2d..7d04378560fd 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -9,18 +9,22 @@
 #include <linux/user_namespace.h>
 #include <linux/security.h>

+static bool bpf_ns_capable(struct user_namespace *ns, int cap)
+{
+       return ns_capable(ns, cap) || (cap != CAP_SYS_ADMIN &&
ns_capable(ns, CAP_SYS_ADMIN));
+}
+
 bool bpf_token_capable(const struct bpf_token *token, int cap)
 {
-       /* BPF token allows ns_capable() level of capabilities, but only if
-        * token's userns is *exactly* the same as current user's userns
-        */
-       if (token && current_user_ns() == token->userns) {
-               if (ns_capable(token->userns, cap) ||
-                   (cap != CAP_SYS_ADMIN && ns_capable(token->userns,
CAP_SYS_ADMIN)))
-                       return security_bpf_token_capable(token, cap) == 0;
-       }
-       /* otherwise fallback to capable() checks */
-       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+       struct user_namespace *userns;
+
+       /* BPF token allows ns_capable() level of capabilities */
+       userns = token ? token->userns : &init_user_ns;
+       if (!bpf_ns_capable(userns, cap))
+               return false;
+       if (token && security_bpf_token_capable(token, cap) < 0)
+               return false;
+       return true;
 }

 void bpf_token_inc(struct bpf_token *token)
@@ -32,7 +36,7 @@ static void bpf_token_free(struct bpf_token *token)
 {
        security_bpf_token_free(token);
        put_user_ns(token->userns);
-       kvfree(token);
+       kfree(token);
 }

 static void bpf_token_put_deferred(struct work_struct *work)
@@ -152,6 +156,12 @@ int bpf_token_create(union bpf_attr *attr)
                goto out_path;
        }

+       /* Creating BPF token in init_user_ns doesn't make much sense. */
+       if (current_user_ns() == &init_user_ns) {
+               err = -EOPNOTSUPP;
+               goto out_path;
+       }
+
        mnt_opts = path.dentry->d_sb->s_fs_info;
        if (mnt_opts->delegate_cmds == 0 &&
            mnt_opts->delegate_maps == 0 &&
@@ -179,7 +189,7 @@ int bpf_token_create(union bpf_attr *attr)
                goto out_path;
        }

-       token = kvzalloc(sizeof(*token), GFP_USER);
+       token = kzalloc(sizeof(*token), GFP_USER);
        if (!token) {
                err = -ENOMEM;
                goto out_file;
Christian Brauner Jan. 16, 2024, 4:37 p.m. UTC | #26
On Sat, Jan 13, 2024 at 06:29:33PM -0800, Andrii Nakryiko wrote:
> On Fri, Jan 12, 2024 at 11:17 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > > > My point is that the capable logic will walk upwards the user namespace
> > > > hierarchy from the token->userns until the user namespace of the caller
> > > > and terminate when it reached the init_user_ns.
> > > >
> > > > A caller is located in some namespace at the point where they call this
> > > > function. They provided a token. The caller isn't capable in the
> > > > namespace of the token so the function falls back to init_user_ns. Two
> > > > interesting cases:
> > > >
> > > > (1) The caller wasn't in an ancestor userns of the token. If that's the
> > > >     case then it follows that the caller also wasn't in the init_user_ns
> > > >     because the init_user_ns is a descendant of all other user
> > > >     namespaces. So falling back will fail.
> > >
> > > agreed
> > >
> > > >
> > > > (2) The caller was in the same or an ancestor user namespace of the
> > > >     token but didn't have the capability in that user namespace:
> > > >
> > > >      (i) They were in a non-init_user_ns. Therefore they can't be
> > > >          privileged in init_user_ns.
> > > >     (ii) They were in init_user_ns. Therefore, they lacked privileges in
> > > >          the init_user_ns.
> > > >
> > > > In both cases your fallback will do nothing iiuc.
> > >
> > > agreed as well
> > >
> > > And I agree in general that there isn't a *practically useful* case
> > > where this would matter much. But there is still (at least one) case
> > > where there could be a regression: if token is created in
> > > init_user_ns, caller has CAP_BPF in init_user_ns, caller passes that
> > > token to BPF_PROG_LOAD, and LSM policy rejects that token in
> > > security_bpf_token_capable(). Without the above implementation such
> > > operation will be rejected, even though if there was no token passed
> > > it would succeed. With my implementation above it will succeed as
> > > expected.
> >
> > If that's the case then prevent the creation of tokens in the
> > init_user_ns and be done with it. If you fallback anyway then this is
> > the correct solution.
> >
> > Make this change, please. I'm not willing to support this weird fallback
> > stuff which is even hard to reason about.
> 
> Alright, added an extra check. Ok, so in summary I have the changes
> below compared to v1 (plus a few extra LSM-related test cases added):
> 
> diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> index a86fccd57e2d..7d04378560fd 100644
> --- a/kernel/bpf/token.c
> +++ b/kernel/bpf/token.c
> @@ -9,18 +9,22 @@
>  #include <linux/user_namespace.h>
>  #include <linux/security.h>
> 
> +static bool bpf_ns_capable(struct user_namespace *ns, int cap)
> +{
> +       return ns_capable(ns, cap) || (cap != CAP_SYS_ADMIN &&
> ns_capable(ns, CAP_SYS_ADMIN));
> +}
> +
>  bool bpf_token_capable(const struct bpf_token *token, int cap)
>  {
> -       /* BPF token allows ns_capable() level of capabilities, but only if
> -        * token's userns is *exactly* the same as current user's userns
> -        */
> -       if (token && current_user_ns() == token->userns) {
> -               if (ns_capable(token->userns, cap) ||
> -                   (cap != CAP_SYS_ADMIN && ns_capable(token->userns,
> CAP_SYS_ADMIN)))
> -                       return security_bpf_token_capable(token, cap) == 0;
> -       }
> -       /* otherwise fallback to capable() checks */
> -       return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
> +       struct user_namespace *userns;
> +
> +       /* BPF token allows ns_capable() level of capabilities */
> +       userns = token ? token->userns : &init_user_ns;
> +       if (!bpf_ns_capable(userns, cap))
> +               return false;
> +       if (token && security_bpf_token_capable(token, cap) < 0)
> +               return false;
> +       return true;
>  }
> 
>  void bpf_token_inc(struct bpf_token *token)
> @@ -32,7 +36,7 @@ static void bpf_token_free(struct bpf_token *token)
>  {
>         security_bpf_token_free(token);
>         put_user_ns(token->userns);
> -       kvfree(token);
> +       kfree(token);
>  }
> 
>  static void bpf_token_put_deferred(struct work_struct *work)
> @@ -152,6 +156,12 @@ int bpf_token_create(union bpf_attr *attr)
>                 goto out_path;
>         }
> 
> +       /* Creating BPF token in init_user_ns doesn't make much sense. */
> +       if (current_user_ns() == &init_user_ns) {
> +               err = -EOPNOTSUPP;
> +               goto out_path;
> +       }
> +
>         mnt_opts = path.dentry->d_sb->s_fs_info;
>         if (mnt_opts->delegate_cmds == 0 &&
>             mnt_opts->delegate_maps == 0 &&
> @@ -179,7 +189,7 @@ int bpf_token_create(union bpf_attr *attr)
>                 goto out_path;
>         }
> 
> -       token = kvzalloc(sizeof(*token), GFP_USER);
> +       token = kzalloc(sizeof(*token), GFP_USER);
>         if (!token) {
>                 err = -ENOMEM;
>                 goto out_file;

Thank you! Looks good,

Acked-by: Christian Brauner <brauner@kernel.org>
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d3658bd959f2..e87fe928645f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -52,6 +52,10 @@  struct module;
 struct bpf_func_state;
 struct ftrace_ops;
 struct cgroup;
+struct bpf_token;
+struct user_namespace;
+struct super_block;
+struct inode;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1620,6 +1624,13 @@  struct bpf_mount_opts {
 	u64 delegate_attachs;
 };
 
+struct bpf_token {
+	struct work_struct work;
+	atomic64_t refcnt;
+	struct user_namespace *userns;
+	u64 allowed_cmds;
+};
+
 struct bpf_struct_ops_value;
 struct btf_member;
 
@@ -2079,6 +2090,7 @@  static inline void bpf_enable_instrumentation(void)
 	migrate_enable();
 }
 
+extern const struct super_operations bpf_super_ops;
 extern const struct file_operations bpf_map_fops;
 extern const struct file_operations bpf_prog_fops;
 extern const struct file_operations bpf_iter_fops;
@@ -2213,6 +2225,8 @@  static inline void bpf_map_dec_elem_count(struct bpf_map *map)
 
 extern int sysctl_unprivileged_bpf_disabled;
 
+bool bpf_token_capable(const struct bpf_token *token, int cap);
+
 static inline bool bpf_allow_ptr_leaks(void)
 {
 	return perfmon_capable();
@@ -2247,8 +2261,17 @@  int bpf_link_new_fd(struct bpf_link *link);
 struct bpf_link *bpf_link_get_from_fd(u32 ufd);
 struct bpf_link *bpf_link_get_curr_or_next(u32 *id);
 
+void bpf_token_inc(struct bpf_token *token);
+void bpf_token_put(struct bpf_token *token);
+int bpf_token_create(union bpf_attr *attr);
+struct bpf_token *bpf_token_get_from_fd(u32 ufd);
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
+
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
+struct inode *bpf_get_inode(struct super_block *sb, const struct inode *dir,
+			    umode_t mode);
 
 #define BPF_ITER_FUNC_PREFIX "bpf_iter_"
 #define DEFINE_BPF_ITER_FUNC(target, args...)			\
@@ -2606,6 +2629,24 @@  static inline int bpf_obj_get_user(const char __user *pathname, int flags)
 	return -EOPNOTSUPP;
 }
 
+static inline bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}
+
+static inline void bpf_token_inc(struct bpf_token *token)
+{
+}
+
+static inline void bpf_token_put(struct bpf_token *token)
+{
+}
+
+static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline void __dev_flush(void)
 {
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 754e68ca8744..ab48e037d543 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -847,6 +847,36 @@  union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality it allows:
+ *		- a set of allowed bpf() syscall commands;
+ *		- a set of allowed BPF map types to be created with
+ *		BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
+ *		- a set of allowed BPF program types and BPF program attach
+ *		types to be loaded with BPF_PROG_LOAD command, if
+ *		BPF_PROG_LOAD itself is allowed.
+ *
+ *		BPF token is created (derived) from an instance of BPF FS,
+ *		assuming it has necessary delegation mount options specified.
+ *		This BPF token can be passed as an extra parameter to various
+ *		bpf() syscall commands to grant BPF subsystem functionality to
+ *		unprivileged processes.
+ *
+ *		When created, BPF token is "associated" with the owning
+ *		user namespace of BPF FS instance (super block) that it was
+ *		derived from, and subsequent BPF operations performed with
+ *		BPF token would be performing capabilities checks (i.e.,
+ *		CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
+ *		that user namespace. Without BPF token, such capabilities
+ *		have to be granted in init user namespace, making bpf()
+ *		syscall incompatible with user namespace, for the most part.
+ *
+ *	Return
+ *		A new file descriptor (a nonnegative integer), or -1 if an
+ *		error occurred (in which case, *errno* is set appropriately).
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -901,6 +931,8 @@  enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
+	__MAX_BPF_CMD,
 };
 
 enum bpf_map_type {
@@ -1714,6 +1746,11 @@  union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		__u32		flags;
+		__u32		bpffs_fd;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f526b7573e97..4ce95acfcaa7 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@  cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
 endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 70b748f6228c..565be1f3f1ea 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -99,9 +99,9 @@  static const struct inode_operations bpf_prog_iops = { };
 static const struct inode_operations bpf_map_iops  = { };
 static const struct inode_operations bpf_link_iops  = { };
 
-static struct inode *bpf_get_inode(struct super_block *sb,
-				   const struct inode *dir,
-				   umode_t mode)
+struct inode *bpf_get_inode(struct super_block *sb,
+			    const struct inode *dir,
+			    umode_t mode)
 {
 	struct inode *inode;
 
@@ -603,6 +603,7 @@  static int bpf_show_options(struct seq_file *m, struct dentry *root)
 	struct inode *inode = d_inode(root);
 	umode_t mode = inode->i_mode & S_IALLUGO & ~S_ISVTX;
 	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
+	u64 mask;
 
 	if (!uid_eq(inode->i_uid, GLOBAL_ROOT_UID))
 		seq_printf(m, ",uid=%u",
@@ -613,7 +614,8 @@  static int bpf_show_options(struct seq_file *m, struct dentry *root)
 	if (mode != S_IRWXUGO)
 		seq_printf(m, ",mode=%o", mode);
 
-	if (opts->delegate_cmds == ~0ULL)
+	mask = (1ULL << __MAX_BPF_CMD) - 1;
+	if ((opts->delegate_cmds & mask) == mask)
 		seq_printf(m, ",delegate_cmds=any");
 	else if (opts->delegate_cmds)
 		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
@@ -646,7 +648,7 @@  static void bpf_free_inode(struct inode *inode)
 	free_inode_nonrcu(inode);
 }
 
-static const struct super_operations bpf_super_ops = {
+const struct super_operations bpf_super_ops = {
 	.statfs		= simple_statfs,
 	.drop_inode	= generic_delete_inode,
 	.show_options	= bpf_show_options,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2392e1802364..423702f33ba6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5390,6 +5390,20 @@  static int bpf_prog_bind_map(union bpf_attr *attr)
 	return ret;
 }
 
+#define BPF_TOKEN_CREATE_LAST_FIELD token_create.bpffs_fd
+
+static int token_create(union bpf_attr *attr)
+{
+	if (CHECK_ATTR(BPF_TOKEN_CREATE))
+		return -EINVAL;
+
+	/* no flags are supported yet */
+	if (attr->token_create.flags)
+		return -EINVAL;
+
+	return bpf_token_create(attr);
+}
+
 static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 {
 	union bpf_attr attr;
@@ -5523,6 +5537,9 @@  static int __sys_bpf(int cmd, bpfptr_t uattr, unsigned int size)
 	case BPF_PROG_BIND_MAP:
 		err = bpf_prog_bind_map(&attr);
 		break;
+	case BPF_TOKEN_CREATE:
+		err = token_create(&attr);
+		break;
 	default:
 		err = -EINVAL;
 		break;
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
new file mode 100644
index 000000000000..e18aaecc67e9
--- /dev/null
+++ b/kernel/bpf/token.c
@@ -0,0 +1,214 @@ 
+#include <linux/bpf.h>
+#include <linux/vmalloc.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/idr.h>
+#include <linux/namei.h>
+#include <linux/user_namespace.h>
+
+bool bpf_token_capable(const struct bpf_token *token, int cap)
+{
+	/* BPF token allows ns_capable() level of capabilities, but only if
+	 * token's userns is *exactly* the same as current user's userns
+	 */
+	if (token && current_user_ns() == token->userns) {
+		if (ns_capable(token->userns, cap))
+			return true;
+		if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
+			return true;
+	}
+	/* otherwise fallback to capable() checks */
+	return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
+}
+
+void bpf_token_inc(struct bpf_token *token)
+{
+	atomic64_inc(&token->refcnt);
+}
+
+static void bpf_token_free(struct bpf_token *token)
+{
+	put_user_ns(token->userns);
+	kvfree(token);
+}
+
+static void bpf_token_put_deferred(struct work_struct *work)
+{
+	struct bpf_token *token = container_of(work, struct bpf_token, work);
+
+	bpf_token_free(token);
+}
+
+void bpf_token_put(struct bpf_token *token)
+{
+	if (!token)
+		return;
+
+	if (!atomic64_dec_and_test(&token->refcnt))
+		return;
+
+	INIT_WORK(&token->work, bpf_token_put_deferred);
+	schedule_work(&token->work);
+}
+
+static int bpf_token_release(struct inode *inode, struct file *filp)
+{
+	struct bpf_token *token = filp->private_data;
+
+	bpf_token_put(token);
+	return 0;
+}
+
+static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
+{
+	struct bpf_token *token = filp->private_data;
+	u64 mask;
+
+	BUILD_BUG_ON(__MAX_BPF_CMD >= 64);
+	mask = (1ULL << __MAX_BPF_CMD) - 1;
+	if ((token->allowed_cmds & mask) == mask)
+		seq_printf(m, "allowed_cmds:\tany\n");
+	else
+		seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds);
+}
+
+#define BPF_TOKEN_INODE_NAME "bpf-token"
+
+static const struct inode_operations bpf_token_iops = { };
+
+static const struct file_operations bpf_token_fops = {
+	.release	= bpf_token_release,
+	.show_fdinfo	= bpf_token_show_fdinfo,
+};
+
+int bpf_token_create(union bpf_attr *attr)
+{
+	struct bpf_mount_opts *mnt_opts;
+	struct bpf_token *token = NULL;
+	struct user_namespace *userns;
+	struct inode *inode;
+	struct file *file;
+	struct path path;
+	struct fd f;
+	umode_t mode;
+	int err, fd;
+
+	f = fdget(attr->token_create.bpffs_fd);
+	if (!f.file)
+		return -EBADF;
+
+	path = f.file->f_path;
+	path_get(&path);
+	fdput(f);
+
+	if (path.dentry != path.mnt->mnt_sb->s_root) {
+		err = -EINVAL;
+		goto out_path;
+	}
+	if (path.mnt->mnt_sb->s_op != &bpf_super_ops) {
+		err = -EINVAL;
+		goto out_path;
+	}
+	err = path_permission(&path, MAY_ACCESS);
+	if (err)
+		goto out_path;
+
+	userns = path.dentry->d_sb->s_user_ns;
+	/*
+	 * Enforce that creators of BPF tokens are in the same user
+	 * namespace as the BPF FS instance. This makes reasoning about
+	 * permissions a lot easier and we can always relax this later.
+	 */
+	if (current_user_ns() != userns) {
+		err = -EPERM;
+		goto out_path;
+	}
+	if (!ns_capable(userns, CAP_BPF)) {
+		err = -EPERM;
+		goto out_path;
+	}
+
+	mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
+	inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
+	if (IS_ERR(inode)) {
+		err = PTR_ERR(inode);
+		goto out_path;
+	}
+
+	inode->i_op = &bpf_token_iops;
+	inode->i_fop = &bpf_token_fops;
+	clear_nlink(inode); /* make sure it is unlinked */
+
+	file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		err = PTR_ERR(file);
+		goto out_path;
+	}
+
+	token = kvzalloc(sizeof(*token), GFP_USER);
+	if (!token) {
+		err = -ENOMEM;
+		goto out_file;
+	}
+
+	atomic64_set(&token->refcnt, 1);
+
+	/* remember bpffs owning userns for future ns_capable() checks */
+	token->userns = get_user_ns(userns);
+
+	mnt_opts = path.dentry->d_sb->s_fs_info;
+	token->allowed_cmds = mnt_opts->delegate_cmds;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		err = fd;
+		goto out_token;
+	}
+
+	file->private_data = token;
+	fd_install(fd, file);
+
+	path_put(&path);
+	return fd;
+
+out_token:
+	bpf_token_free(token);
+out_file:
+	fput(file);
+out_path:
+	path_put(&path);
+	return err;
+}
+
+struct bpf_token *bpf_token_get_from_fd(u32 ufd)
+{
+	struct fd f = fdget(ufd);
+	struct bpf_token *token;
+
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+	if (f.file->f_op != &bpf_token_fops) {
+		fdput(f);
+		return ERR_PTR(-EINVAL);
+	}
+
+	token = f.file->private_data;
+	bpf_token_inc(token);
+	fdput(f);
+
+	return token;
+}
+
+bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
+{
+	/* BPF token can be used only within exactly the same userns in which
+	 * it was created
+	 */
+	if (!token || current_user_ns() != token->userns)
+		return false;
+
+	return token->allowed_cmds & (1ULL << cmd);
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7f24d898efbb..57487d0c0b73 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -847,6 +847,36 @@  union bpf_iter_link_info {
  *		Returns zero on success. On error, -1 is returned and *errno*
  *		is set appropriately.
  *
+ * BPF_TOKEN_CREATE
+ *	Description
+ *		Create BPF token with embedded information about what
+ *		BPF-related functionality it allows:
+ *		- a set of allowed bpf() syscall commands;
+ *		- a set of allowed BPF map types to be created with
+ *		BPF_MAP_CREATE command, if BPF_MAP_CREATE itself is allowed;
+ *		- a set of allowed BPF program types and BPF program attach
+ *		types to be loaded with BPF_PROG_LOAD command, if
+ *		BPF_PROG_LOAD itself is allowed.
+ *
+ *		BPF token is created (derived) from an instance of BPF FS,
+ *		assuming it has necessary delegation mount options specified.
+ *		This BPF token can be passed as an extra parameter to various
+ *		bpf() syscall commands to grant BPF subsystem functionality to
+ *		unprivileged processes.
+ *
+ *		When created, BPF token is "associated" with the owning
+ *		user namespace of BPF FS instance (super block) that it was
+ *		derived from, and subsequent BPF operations performed with
+ *		BPF token would be performing capabilities checks (i.e.,
+ *		CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN) within
+ *		that user namespace. Without BPF token, such capabilities
+ *		have to be granted in init user namespace, making bpf()
+ *		syscall incompatible with user namespace, for the most part.
+ *
+ *	Return
+ *		A new file descriptor (a nonnegative integer), or -1 if an
+ *		error occurred (in which case, *errno* is set appropriately).
+ *
  * NOTES
  *	eBPF objects (maps and programs) can be shared between processes.
  *
@@ -901,6 +931,8 @@  enum bpf_cmd {
 	BPF_ITER_CREATE,
 	BPF_LINK_DETACH,
 	BPF_PROG_BIND_MAP,
+	BPF_TOKEN_CREATE,
+	__MAX_BPF_CMD,
 };
 
 enum bpf_map_type {
@@ -1714,6 +1746,11 @@  union bpf_attr {
 		__u32		flags;		/* extra flags */
 	} prog_bind_map;
 
+	struct { /* struct used by BPF_TOKEN_CREATE command */
+		__u32		flags;
+		__u32		bpffs_fd;
+	} token_create;
+
 } __attribute__((aligned(8)));
 
 /* The description below is an attempt at providing documentation to eBPF