[RFC,bpf-next,00/13] bpf: Introduce BPF namespace

Message ID	20230326092208.13613-1-laoar.shao@gmail.com (mailing list archive)
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: Yafang Shao <laoar.shao@gmail.com> To: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Yafang Shao <laoar.shao@gmail.com> Subject: [RFC PATCH bpf-next 00/13] bpf: Introduce BPF namespace Date: Sun, 26 Mar 2023 09:21:55 +0000 Message-Id: <20230326092208.13613-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bpf: Introduce BPF namespace \| expand [RFC,bpf-next,00/13] bpf: Introduce BPF namespace [RFC,bpf-next,01/13] fork: New clone3 flag for BPF namespace [RFC,bpf-next,02/13] proc_ns: Extend the field type in struct proc_ns_operations to long [RFC,bpf-next,03/13] bpf: Implement bpf namespace [RFC,bpf-next,04/13] bpf: No need to check if id is 0 [RFC,bpf-next,05/13] bpf: Make bpf objects id have the same alloc and free pattern [RFC,bpf-next,06/13] bpf: Helpers to alloc and free object id in bpf namespace [RFC,bpf-next,07/13] bpf: Add bpf helper to get bpf object id [RFC,bpf-next,08/13] bpf: Alloc and free bpf_map id in bpf namespace [RFC,bpf-next,09/13] bpf: Alloc and free bpf_prog id in bpf namespace [RFC,bpf-next,10/13] bpf: Alloc and free bpf_link id in bpf namespace [RFC,bpf-next,11/13] bpf: Allow iterating bpf objects with CAP_BPF in bpf namespace [RFC,bpf-next,12/13] bpf: Use bpf_idr_lock array instead [RFC,bpf-next,13/13] selftests/bpf: Add selftest for bpf namespace

Yafang Shao March 26, 2023, 9:21 a.m. UTC

Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
to FDs, that's intended for BPF's security model[1]. Not only does it
prevent non-privilidged users from getting other users' bpf program, but
also it prevents the user from iterating his own bpf objects.

In container environment, some users want to run bpf programs in their
containers. These users can run their bpf programs under CAP_BPF and
some other specific CAPs, but they can't inspect their bpf programs in a
generic way. For example, the bpftool can't be used as it requires
CAP_SYS_ADMIN. That is very inconvenient.

Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
which is not created by the process itself is with SCM_RIGHTS, that
requires each processes which created bpf object has to implement a unix
domain socket to share the fd of a bpf object between different
processes, that is really trivial and troublesome.

Hence we need a better mechanism to get bpf object info without
CAP_SYS_ADMIN. 

BPF namespace is introduced in this patchset with an attempt to remove 
the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
link in a specific bpf namespace, then these bpf objects will not be
visible to the users in a different bpf namespace. But these bpf
objects are visible to its parent bpf namespace, so the sys admin can 
still iterate and inspect them.

BPF namespace is similar to PID namespace, and the bpf objects are
similar to tasks, so BPF namespace is very easy to understand. These
patchset only implements BPF namespace for bpf map, prog and link. In the
future we may extend it to other bpf objects like btf, bpffs and etc.
For example, we can allow some of the BTF objects to be used in
non-init bpf namespace, then the container user can only trace the
processes running in his container, but can't get the information of
tasks running in other containers.

A simple example is introduced into selftests/bpf on how to use the bpf
namespace.

Putting bpf map, prog and link into bpf namespace is the first step.
Let's start with it.

[1]. https://lore.kernel.org/bpf/20200513230355.7858-1-alexei.starovoitov@gmail.com/

Yafang Shao (13):
  fork: New clone3 flag for BPF namespace
  proc_ns: Extend the field type in struct proc_ns_operations to long
  bpf: Implement bpf namespace
  bpf: No need to check if id is 0
  bpf: Make bpf objects id have the same alloc and free pattern
  bpf: Helpers to alloc and free object id in bpf namespace
  bpf: Add bpf helper to get bpf object id
  bpf: Alloc and free bpf_map id in bpf namespace
  bpf: Alloc and free bpf_prog id in bpf namespace
  bpf: Alloc and free bpf_link id in bpf namespace
  bpf: Allow iterating bpf objects with CAP_BPF in bpf namespace
  bpf: Use bpf_idr_lock array instead
  selftests/bpf: Add selftest for bpf namespace

 fs/proc/namespaces.c                      |   4 +
 include/linux/bpf.h                       |   9 +-
 include/linux/bpf_namespace.h             |  88 ++++++++++
 include/linux/nsproxy.h                   |   4 +
 include/linux/proc_ns.h                   |   3 +-
 include/linux/user_namespace.h            |   1 +
 include/uapi/linux/bpf.h                  |   7 +
 include/uapi/linux/sched.h                |   1 +
 kernel/bpf/Makefile                       |   1 +
 kernel/bpf/bpf_namespace.c                | 283 ++++++++++++++++++++++++++++++
 kernel/bpf/offload.c                      |  16 +-
 kernel/bpf/syscall.c                      | 262 ++++++++++-----------------
 kernel/bpf/task_iter.c                    |  12 ++
 kernel/fork.c                             |   5 +-
 kernel/nsproxy.c                          |  19 +-
 kernel/trace/bpf_trace.c                  |   2 +
 kernel/ucount.c                           |   1 +
 tools/bpf/bpftool/skeleton/pid_iter.bpf.c |  13 +-
 tools/include/uapi/linux/bpf.h            |   7 +
 tools/testing/selftests/bpf/Makefile      |   3 +-
 tools/testing/selftests/bpf/test_bpfns.c  |  76 ++++++++
 21 files changed, 637 insertions(+), 180 deletions(-)
 create mode 100644 include/linux/bpf_namespace.h
 create mode 100644 kernel/bpf/bpf_namespace.c
 create mode 100644 tools/testing/selftests/bpf/test_bpfns.c

Toke Høiland-Jørgensen March 26, 2023, 10:49 a.m. UTC | #1

Yafang Shao <laoar.shao@gmail.com> writes:

> Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> to FDs, that's intended for BPF's security model[1]. Not only does it
> prevent non-privilidged users from getting other users' bpf program, but
> also it prevents the user from iterating his own bpf objects.
>
> In container environment, some users want to run bpf programs in their
> containers. These users can run their bpf programs under CAP_BPF and
> some other specific CAPs, but they can't inspect their bpf programs in a
> generic way. For example, the bpftool can't be used as it requires
> CAP_SYS_ADMIN. That is very inconvenient.
>
> Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> which is not created by the process itself is with SCM_RIGHTS, that
> requires each processes which created bpf object has to implement a unix
> domain socket to share the fd of a bpf object between different
> processes, that is really trivial and troublesome.
>
> Hence we need a better mechanism to get bpf object info without
> CAP_SYS_ADMIN. 
>
> BPF namespace is introduced in this patchset with an attempt to remove 
> the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> link in a specific bpf namespace, then these bpf objects will not be
> visible to the users in a different bpf namespace. But these bpf
> objects are visible to its parent bpf namespace, so the sys admin can 
> still iterate and inspect them.
>
> BPF namespace is similar to PID namespace, and the bpf objects are
> similar to tasks, so BPF namespace is very easy to understand. These
> patchset only implements BPF namespace for bpf map, prog and link. In the
> future we may extend it to other bpf objects like btf, bpffs and etc.

May? I think we should cover all of the existing BPF objects from the
beginning here, or we may miss important interactions that will
invalidate the whole idea. In particular, I'm a little worried about the
interaction between namespaces and bpffs; what happens if you're in a
bpf namespace and you try to read a BPF object from a bpffs that belongs
to a different namespace? Does the operation fail? Is the object hidden
entirely? Something else?

-Toke

Yafang Shao March 27, 2023, 3:07 a.m. UTC | #2

On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > to FDs, that's intended for BPF's security model[1]. Not only does it
> > prevent non-privilidged users from getting other users' bpf program, but
> > also it prevents the user from iterating his own bpf objects.
> >
> > In container environment, some users want to run bpf programs in their
> > containers. These users can run their bpf programs under CAP_BPF and
> > some other specific CAPs, but they can't inspect their bpf programs in a
> > generic way. For example, the bpftool can't be used as it requires
> > CAP_SYS_ADMIN. That is very inconvenient.
> >
> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> > which is not created by the process itself is with SCM_RIGHTS, that
> > requires each processes which created bpf object has to implement a unix
> > domain socket to share the fd of a bpf object between different
> > processes, that is really trivial and troublesome.
> >
> > Hence we need a better mechanism to get bpf object info without
> > CAP_SYS_ADMIN.
> >
> > BPF namespace is introduced in this patchset with an attempt to remove
> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > link in a specific bpf namespace, then these bpf objects will not be
> > visible to the users in a different bpf namespace. But these bpf
> > objects are visible to its parent bpf namespace, so the sys admin can
> > still iterate and inspect them.
> >
> > BPF namespace is similar to PID namespace, and the bpf objects are
> > similar to tasks, so BPF namespace is very easy to understand. These
> > patchset only implements BPF namespace for bpf map, prog and link. In the
> > future we may extend it to other bpf objects like btf, bpffs and etc.
>
> May? I think we should cover all of the existing BPF objects from the
> beginning here, or we may miss important interactions that will
> invalidate the whole idea.

This patchset is intended to address iterating bpf IDs and converting
IDs to FDs.  To be more specific, it covers
BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID.
It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID,
but I don't implement it because I find we can do more wrt BTF, for
example, if we can expose a small amount of BTFs in the vmlinux to
non-root bpf namespace.
But, yes, I should implement BTF ID in this patchset.

> In particular, I'm a little worried about the
> interaction between namespaces and bpffs; what happens if you're in a
> bpf namespace and you try to read a BPF object from a bpffs that belongs
> to a different namespace? Does the operation fail? Is the object hidden
> entirely? Something else?
>

bpffs is a different topic and it can be implemented in later patchsets.
bpffs has its own specific problem even without the bpf namespace.
1. The user can always get the information of a bpf object through its
corresponding pinned file.
In our practice, different container users have different bpffs, and
we allow the container user to bind-mount its bpffs only, so others'
bpffs are invisible.
To make it better with the bpf namespace, I think we can fail the
operation if the pinned file doesn't belong to its bpf namespace. That
said, we will add pinned bpf files into the bpf namespace in the next
step.

2. The user can always iterate bpf objects through progs.debug and maps.debug
progs.debug and maps.debug are debugging purposes only. So I think we
can handle it later.

Stanislav Fomichev March 27, 2023, 5:28 p.m. UTC | #3

On 03/26, Yafang Shao wrote:
> Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> to FDs, that's intended for BPF's security model[1]. Not only does it
> prevent non-privilidged users from getting other users' bpf program, but
> also it prevents the user from iterating his own bpf objects.

> In container environment, some users want to run bpf programs in their
> containers. These users can run their bpf programs under CAP_BPF and
> some other specific CAPs, but they can't inspect their bpf programs in a
> generic way. For example, the bpftool can't be used as it requires
> CAP_SYS_ADMIN. That is very inconvenient.

> Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> which is not created by the process itself is with SCM_RIGHTS, that
> requires each processes which created bpf object has to implement a unix
> domain socket to share the fd of a bpf object between different
> processes, that is really trivial and troublesome.

> Hence we need a better mechanism to get bpf object info without
> CAP_SYS_ADMIN.

[..]

> BPF namespace is introduced in this patchset with an attempt to remove
> the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> link in a specific bpf namespace, then these bpf objects will not be
> visible to the users in a different bpf namespace. But these bpf
> objects are visible to its parent bpf namespace, so the sys admin can
> still iterate and inspect them.

Does it essentially mean unpriv bpf? Can I, as a non-root, create
a new bpf namespace and start loading/attaching progs?
Maybe add a paragraph about now vs whatever you're proposing.
Otherwise it's not very clear what's the security story.
(haven't looked at the whole series, so maybe it's answered somewhere else?)

> BPF namespace is similar to PID namespace, and the bpf objects are
> similar to tasks, so BPF namespace is very easy to understand. These
> patchset only implements BPF namespace for bpf map, prog and link. In the
> future we may extend it to other bpf objects like btf, bpffs and etc.
> For example, we can allow some of the BTF objects to be used in
> non-init bpf namespace, then the container user can only trace the
> processes running in his container, but can't get the information of
> tasks running in other containers.

> A simple example is introduced into selftests/bpf on how to use the bpf
> namespace.

> Putting bpf map, prog and link into bpf namespace is the first step.
> Let's start with it.

> [1].  
> https://lore.kernel.org/bpf/20200513230355.7858-1-alexei.starovoitov@gmail.com/

> Yafang Shao (13):
>    fork: New clone3 flag for BPF namespace
>    proc_ns: Extend the field type in struct proc_ns_operations to long
>    bpf: Implement bpf namespace
>    bpf: No need to check if id is 0
>    bpf: Make bpf objects id have the same alloc and free pattern
>    bpf: Helpers to alloc and free object id in bpf namespace
>    bpf: Add bpf helper to get bpf object id
>    bpf: Alloc and free bpf_map id in bpf namespace
>    bpf: Alloc and free bpf_prog id in bpf namespace
>    bpf: Alloc and free bpf_link id in bpf namespace
>    bpf: Allow iterating bpf objects with CAP_BPF in bpf namespace
>    bpf: Use bpf_idr_lock array instead
>    selftests/bpf: Add selftest for bpf namespace

>   fs/proc/namespaces.c                      |   4 +
>   include/linux/bpf.h                       |   9 +-
>   include/linux/bpf_namespace.h             |  88 ++++++++++
>   include/linux/nsproxy.h                   |   4 +
>   include/linux/proc_ns.h                   |   3 +-
>   include/linux/user_namespace.h            |   1 +
>   include/uapi/linux/bpf.h                  |   7 +
>   include/uapi/linux/sched.h                |   1 +
>   kernel/bpf/Makefile                       |   1 +
>   kernel/bpf/bpf_namespace.c                | 283  
> ++++++++++++++++++++++++++++++
>   kernel/bpf/offload.c                      |  16 +-
>   kernel/bpf/syscall.c                      | 262  
> ++++++++++-----------------
>   kernel/bpf/task_iter.c                    |  12 ++
>   kernel/fork.c                             |   5 +-
>   kernel/nsproxy.c                          |  19 +-
>   kernel/trace/bpf_trace.c                  |   2 +
>   kernel/ucount.c                           |   1 +
>   tools/bpf/bpftool/skeleton/pid_iter.bpf.c |  13 +-
>   tools/include/uapi/linux/bpf.h            |   7 +
>   tools/testing/selftests/bpf/Makefile      |   3 +-
>   tools/testing/selftests/bpf/test_bpfns.c  |  76 ++++++++
>   21 files changed, 637 insertions(+), 180 deletions(-)
>   create mode 100644 include/linux/bpf_namespace.h
>   create mode 100644 kernel/bpf/bpf_namespace.c
>   create mode 100644 tools/testing/selftests/bpf/test_bpfns.c

> --
> 1.8.3.1

Song Liu March 27, 2023, 7:03 p.m. UTC | #4

On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> to FDs, that's intended for BPF's security model[1]. Not only does it
> prevent non-privilidged users from getting other users' bpf program, but
> also it prevents the user from iterating his own bpf objects.
>
> In container environment, some users want to run bpf programs in their
> containers. These users can run their bpf programs under CAP_BPF and
> some other specific CAPs, but they can't inspect their bpf programs in a
> generic way. For example, the bpftool can't be used as it requires
> CAP_SYS_ADMIN. That is very inconvenient.

Agreed that it is important to enable tools like bpftool without
CAP_SYS_ADMIN. However, I am not sure whether we need a new
namespace for this. Can we reuse some existing namespace for this?

If we do need a new namespace, maybe we should share some effort
with tracer namespace proposal [1]?

Thanks,
Song

[1] https://lpc.events/event/16/contributions/1237/

Toke Høiland-Jørgensen March 27, 2023, 8:51 p.m. UTC | #5

Yafang Shao <laoar.shao@gmail.com> writes:

> On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
>> > to FDs, that's intended for BPF's security model[1]. Not only does it
>> > prevent non-privilidged users from getting other users' bpf program, but
>> > also it prevents the user from iterating his own bpf objects.
>> >
>> > In container environment, some users want to run bpf programs in their
>> > containers. These users can run their bpf programs under CAP_BPF and
>> > some other specific CAPs, but they can't inspect their bpf programs in a
>> > generic way. For example, the bpftool can't be used as it requires
>> > CAP_SYS_ADMIN. That is very inconvenient.
>> >
>> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
>> > which is not created by the process itself is with SCM_RIGHTS, that
>> > requires each processes which created bpf object has to implement a unix
>> > domain socket to share the fd of a bpf object between different
>> > processes, that is really trivial and troublesome.
>> >
>> > Hence we need a better mechanism to get bpf object info without
>> > CAP_SYS_ADMIN.
>> >
>> > BPF namespace is introduced in this patchset with an attempt to remove
>> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
>> > link in a specific bpf namespace, then these bpf objects will not be
>> > visible to the users in a different bpf namespace. But these bpf
>> > objects are visible to its parent bpf namespace, so the sys admin can
>> > still iterate and inspect them.
>> >
>> > BPF namespace is similar to PID namespace, and the bpf objects are
>> > similar to tasks, so BPF namespace is very easy to understand. These
>> > patchset only implements BPF namespace for bpf map, prog and link. In the
>> > future we may extend it to other bpf objects like btf, bpffs and etc.
>>
>> May? I think we should cover all of the existing BPF objects from the
>> beginning here, or we may miss important interactions that will
>> invalidate the whole idea.
>
> This patchset is intended to address iterating bpf IDs and converting
> IDs to FDs.  To be more specific, it covers
> BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID.
> It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID,
> but I don't implement it because I find we can do more wrt BTF, for
> example, if we can expose a small amount of BTFs in the vmlinux to
> non-root bpf namespace.
> But, yes, I should implement BTF ID in this patchset.

Right, as you can see by my comment on that patch, not including the btf
id is a tad confusing, so yeah, better include that.

>> In particular, I'm a little worried about the
>> interaction between namespaces and bpffs; what happens if you're in a
>> bpf namespace and you try to read a BPF object from a bpffs that belongs
>> to a different namespace? Does the operation fail? Is the object hidden
>> entirely? Something else?
>>
>
> bpffs is a different topic and it can be implemented in later patchsets.
> bpffs has its own specific problem even without the bpf namespace.
> 1. The user can always get the information of a bpf object through its
> corresponding pinned file.
> In our practice, different container users have different bpffs, and
> we allow the container user to bind-mount its bpffs only, so others'
> bpffs are invisible.
> To make it better with the bpf namespace, I think we can fail the
> operation if the pinned file doesn't belong to its bpf namespace. That
> said, we will add pinned bpf files into the bpf namespace in the next
> step.
>
> 2. The user can always iterate bpf objects through progs.debug and maps.debug
> progs.debug and maps.debug are debugging purposes only. So I think we
> can handle it later.

Well, I disagree. Working out these issues with bpffs is an important
aspect to get a consistent API, and handwaving it away risks merging
something that will turn out to not be workable further down the line at
which point we can't change it.

-Toke

Yafang Shao March 28, 2023, 3:42 a.m. UTC | #6

On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/26, Yafang Shao wrote:
> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > to FDs, that's intended for BPF's security model[1]. Not only does it
> > prevent non-privilidged users from getting other users' bpf program, but
> > also it prevents the user from iterating his own bpf objects.
>
> > In container environment, some users want to run bpf programs in their
> > containers. These users can run their bpf programs under CAP_BPF and
> > some other specific CAPs, but they can't inspect their bpf programs in a
> > generic way. For example, the bpftool can't be used as it requires
> > CAP_SYS_ADMIN. That is very inconvenient.
>
> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> > which is not created by the process itself is with SCM_RIGHTS, that
> > requires each processes which created bpf object has to implement a unix
> > domain socket to share the fd of a bpf object between different
> > processes, that is really trivial and troublesome.
>
> > Hence we need a better mechanism to get bpf object info without
> > CAP_SYS_ADMIN.
>
> [..]
>
> > BPF namespace is introduced in this patchset with an attempt to remove
> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > link in a specific bpf namespace, then these bpf objects will not be
> > visible to the users in a different bpf namespace. But these bpf
> > objects are visible to its parent bpf namespace, so the sys admin can
> > still iterate and inspect them.
>
> Does it essentially mean unpriv bpf?

Right. With CAP_BPF and some other CAPs enabled.

> Can I, as a non-root, create
> a new bpf namespace and start loading/attaching progs?

No, you can't create a new bpf namespace as a non-root, see also
copy_namespaces().
In the container environment, new namespaces are always created by
containered, which is started by root.

> Maybe add a paragraph about now vs whatever you're proposing.

What I'm proposing in this patchset is to put bpf objects (map, prog,
link, and btf) into the bpf namespace. Next step I will put bpffs into
the bpf namespace as well.
That said, I'm trying to put  all the objects created in bpf into the
bpf namespace. Below is a simple paragraph to illustrate it.

Regarding the unpriv user with CAP_BPF enabled,
                                                             Now | Future
------------------------------------------------------------------------
Iterate his BPF IDs                                | N   |  Y  |
Iterate others' BPF IDs                          | N   |  N  |
Convert his BPF IDs to FDs                  | N   |  Y  |
Convert others' BPF IDs to FDs            | N   |  N  |
Get others' object info from pinned file  | Y(*) | N  |
------------------------------------------------------------------------

(*) It can be improved by,
     1). Different containers has different bpffs
     2). Setting file permission
     That's not perfect, for example, if one single user has two bpf
instances, but we don't want them to inspect each other.

> Otherwise it's not very clear what's the security story.
> (haven't looked at the whole series, so maybe it's answered somewhere else?)
>

Yafang Shao March 28, 2023, 3:47 a.m. UTC | #7

On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote:
>
> On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > to FDs, that's intended for BPF's security model[1]. Not only does it
> > prevent non-privilidged users from getting other users' bpf program, but
> > also it prevents the user from iterating his own bpf objects.
> >
> > In container environment, some users want to run bpf programs in their
> > containers. These users can run their bpf programs under CAP_BPF and
> > some other specific CAPs, but they can't inspect their bpf programs in a
> > generic way. For example, the bpftool can't be used as it requires
> > CAP_SYS_ADMIN. That is very inconvenient.
>
> Agreed that it is important to enable tools like bpftool without
> CAP_SYS_ADMIN. However, I am not sure whether we need a new
> namespace for this. Can we reuse some existing namespace for this?
>

It seems we can't.

> If we do need a new namespace, maybe we should share some effort
> with tracer namespace proposal [1]?
>

Thanks for your information. I will learn the tracer namespace first
and try to analyze how to cooperate with it.

> Thanks,
> Song
>
> [1] https://lpc.events/event/16/contributions/1237/

Yafang Shao March 28, 2023, 3:48 a.m. UTC | #8

On Tue, Mar 28, 2023 at 4:51 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> >> > to FDs, that's intended for BPF's security model[1]. Not only does it
> >> > prevent non-privilidged users from getting other users' bpf program, but
> >> > also it prevents the user from iterating his own bpf objects.
> >> >
> >> > In container environment, some users want to run bpf programs in their
> >> > containers. These users can run their bpf programs under CAP_BPF and
> >> > some other specific CAPs, but they can't inspect their bpf programs in a
> >> > generic way. For example, the bpftool can't be used as it requires
> >> > CAP_SYS_ADMIN. That is very inconvenient.
> >> >
> >> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
> >> > which is not created by the process itself is with SCM_RIGHTS, that
> >> > requires each processes which created bpf object has to implement a unix
> >> > domain socket to share the fd of a bpf object between different
> >> > processes, that is really trivial and troublesome.
> >> >
> >> > Hence we need a better mechanism to get bpf object info without
> >> > CAP_SYS_ADMIN.
> >> >
> >> > BPF namespace is introduced in this patchset with an attempt to remove
> >> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> >> > link in a specific bpf namespace, then these bpf objects will not be
> >> > visible to the users in a different bpf namespace. But these bpf
> >> > objects are visible to its parent bpf namespace, so the sys admin can
> >> > still iterate and inspect them.
> >> >
> >> > BPF namespace is similar to PID namespace, and the bpf objects are
> >> > similar to tasks, so BPF namespace is very easy to understand. These
> >> > patchset only implements BPF namespace for bpf map, prog and link. In the
> >> > future we may extend it to other bpf objects like btf, bpffs and etc.
> >>
> >> May? I think we should cover all of the existing BPF objects from the
> >> beginning here, or we may miss important interactions that will
> >> invalidate the whole idea.
> >
> > This patchset is intended to address iterating bpf IDs and converting
> > IDs to FDs.  To be more specific, it covers
> > BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID.
> > It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID,
> > but I don't implement it because I find we can do more wrt BTF, for
> > example, if we can expose a small amount of BTFs in the vmlinux to
> > non-root bpf namespace.
> > But, yes, I should implement BTF ID in this patchset.
>
> Right, as you can see by my comment on that patch, not including the btf
> id is a tad confusing, so yeah, better include that.
>
> >> In particular, I'm a little worried about the
> >> interaction between namespaces and bpffs; what happens if you're in a
> >> bpf namespace and you try to read a BPF object from a bpffs that belongs
> >> to a different namespace? Does the operation fail? Is the object hidden
> >> entirely? Something else?
> >>
> >
> > bpffs is a different topic and it can be implemented in later patchsets.
> > bpffs has its own specific problem even without the bpf namespace.
> > 1. The user can always get the information of a bpf object through its
> > corresponding pinned file.
> > In our practice, different container users have different bpffs, and
> > we allow the container user to bind-mount its bpffs only, so others'
> > bpffs are invisible.
> > To make it better with the bpf namespace, I think we can fail the
> > operation if the pinned file doesn't belong to its bpf namespace. That
> > said, we will add pinned bpf files into the bpf namespace in the next
> > step.
> >
> > 2. The user can always iterate bpf objects through progs.debug and maps.debug
> > progs.debug and maps.debug are debugging purposes only. So I think we
> > can handle it later.
>
> Well, I disagree. Working out these issues with bpffs is an important
> aspect to get a consistent API, and handwaving it away risks merging
> something that will turn out to not be workable further down the line at
> which point we can't change it.
>

Sure, I will include bpffs in the next version.

Stanislav Fomichev March 28, 2023, 5:15 p.m. UTC | #9

On 03/28, Yafang Shao wrote:
> On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/26, Yafang Shao wrote:
> > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert  
> IDs
> > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > prevent non-privilidged users from getting other users' bpf program,  
> but
> > > also it prevents the user from iterating his own bpf objects.
> >
> > > In container environment, some users want to run bpf programs in their
> > > containers. These users can run their bpf programs under CAP_BPF and
> > > some other specific CAPs, but they can't inspect their bpf programs  
> in a
> > > generic way. For example, the bpftool can't be used as it requires
> > > CAP_SYS_ADMIN. That is very inconvenient.
> >
> > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf  
> object
> > > which is not created by the process itself is with SCM_RIGHTS, that
> > > requires each processes which created bpf object has to implement a  
> unix
> > > domain socket to share the fd of a bpf object between different
> > > processes, that is really trivial and troublesome.
> >
> > > Hence we need a better mechanism to get bpf object info without
> > > CAP_SYS_ADMIN.
> >
> > [..]
> >
> > > BPF namespace is introduced in this patchset with an attempt to remove
> > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > > link in a specific bpf namespace, then these bpf objects will not be
> > > visible to the users in a different bpf namespace. But these bpf
> > > objects are visible to its parent bpf namespace, so the sys admin can
> > > still iterate and inspect them.
> >
> > Does it essentially mean unpriv bpf?

> Right. With CAP_BPF and some other CAPs enabled.

> > Can I, as a non-root, create
> > a new bpf namespace and start loading/attaching progs?

> No, you can't create a new bpf namespace as a non-root, see also
> copy_namespaces().
> In the container environment, new namespaces are always created by
> containered, which is started by root.

Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part
from copy_namespaces? Isn't it trivially bypassed with a new user
namespace?

IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN
in this particular user-ns. Then I can go on and create a new bpf
namespace (with CAP_BPF) and go wild? I won't see anything from the
other namespaces, but I'll be able to load/attach bpf programs?

> > Maybe add a paragraph about now vs whatever you're proposing.

> What I'm proposing in this patchset is to put bpf objects (map, prog,
> link, and btf) into the bpf namespace. Next step I will put bpffs into
> the bpf namespace as well.
> That said, I'm trying to put  all the objects created in bpf into the
> bpf namespace. Below is a simple paragraph to illustrate it.

> Regarding the unpriv user with CAP_BPF enabled,
>                                                               Now | Future
> ------------------------------------------------------------------------
> Iterate his BPF IDs                                | N   |  Y  |
> Iterate others' BPF IDs                          | N   |  N  |
> Convert his BPF IDs to FDs                  | N   |  Y  |
> Convert others' BPF IDs to FDs            | N   |  N  |
> Get others' object info from pinned file  | Y(*) | N  |
> ------------------------------------------------------------------------

> (*) It can be improved by,
>       1). Different containers has different bpffs
>       2). Setting file permission
>       That's not perfect, for example, if one single user has two bpf
> instances, but we don't want them to inspect each other.

I think the question here is what happens to the existing
capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually?

And if not, I don't think it integrates well with the user namespaces?

> > Otherwise it's not very clear what's the security story.
> > (haven't looked at the whole series, so maybe it's answered somewhere  
> else?)
> >


> --
> Regards
> Yafang

Yafang Shao March 29, 2023, 3:02 a.m. UTC | #10

On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 03/28, Yafang Shao wrote:
> > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 03/26, Yafang Shao wrote:
> > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert
> > IDs
> > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > prevent non-privilidged users from getting other users' bpf program,
> > but
> > > > also it prevents the user from iterating his own bpf objects.
> > >
> > > > In container environment, some users want to run bpf programs in their
> > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > some other specific CAPs, but they can't inspect their bpf programs
> > in a
> > > > generic way. For example, the bpftool can't be used as it requires
> > > > CAP_SYS_ADMIN. That is very inconvenient.
> > >
> > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf
> > object
> > > > which is not created by the process itself is with SCM_RIGHTS, that
> > > > requires each processes which created bpf object has to implement a
> > unix
> > > > domain socket to share the fd of a bpf object between different
> > > > processes, that is really trivial and troublesome.
> > >
> > > > Hence we need a better mechanism to get bpf object info without
> > > > CAP_SYS_ADMIN.
> > >
> > > [..]
> > >
> > > > BPF namespace is introduced in this patchset with an attempt to remove
> > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > > > link in a specific bpf namespace, then these bpf objects will not be
> > > > visible to the users in a different bpf namespace. But these bpf
> > > > objects are visible to its parent bpf namespace, so the sys admin can
> > > > still iterate and inspect them.
> > >
> > > Does it essentially mean unpriv bpf?
>
> > Right. With CAP_BPF and some other CAPs enabled.
>
> > > Can I, as a non-root, create
> > > a new bpf namespace and start loading/attaching progs?
>
> > No, you can't create a new bpf namespace as a non-root, see also
> > copy_namespaces().
> > In the container environment, new namespaces are always created by
> > containered, which is started by root.
>
> Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part
> from copy_namespaces? Isn't it trivially bypassed with a new user
> namespace?
>
> IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN
> in this particular user-ns. Then I can go on and create a new bpf
> namespace (with CAP_BPF) and go wild? I won't see anything from the
> other namespaces, but I'll be able to load/attach bpf programs?
>

I don't think so. If you create a new userspace, and give the process
the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the
initial namespace, you can't do that. Because currently only CAP_BPF
or CAP_SYS_ADMIN in the init user namespace can load/attach bpf
programs.

> > > Maybe add a paragraph about now vs whatever you're proposing.
>
> > What I'm proposing in this patchset is to put bpf objects (map, prog,
> > link, and btf) into the bpf namespace. Next step I will put bpffs into
> > the bpf namespace as well.
> > That said, I'm trying to put  all the objects created in bpf into the
> > bpf namespace. Below is a simple paragraph to illustrate it.
>
> > Regarding the unpriv user with CAP_BPF enabled,
> >                                                               Now | Future
> > ------------------------------------------------------------------------
> > Iterate his BPF IDs                                | N   |  Y  |
> > Iterate others' BPF IDs                          | N   |  N  |
> > Convert his BPF IDs to FDs                  | N   |  Y  |
> > Convert others' BPF IDs to FDs            | N   |  N  |
> > Get others' object info from pinned file  | Y(*) | N  |
> > ------------------------------------------------------------------------
>
> > (*) It can be improved by,
> >       1). Different containers has different bpffs
> >       2). Setting file permission
> >       That's not perfect, for example, if one single user has two bpf
> > instances, but we don't want them to inspect each other.
>
> I think the question here is what happens to the existing
> capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually?
>

They won't become ns_capable(CAP_BPF). If it becomes
ns_capable(CAP_BPF), it will really go wild then.

> And if not, I don't think it integrates well with the user namespaces?
>

IIUC, it is the CAP_BPF which doesn't integrate with the user
namespaces, right?

Stanislav Fomichev March 29, 2023, 8:50 p.m. UTC | #11

On Tue, Mar 28, 2023 at 8:03 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 03/28, Yafang Shao wrote:
> > > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On 03/26, Yafang Shao wrote:
> > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert
> > > IDs
> > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > prevent non-privilidged users from getting other users' bpf program,
> > > but
> > > > > also it prevents the user from iterating his own bpf objects.
> > > >
> > > > > In container environment, some users want to run bpf programs in their
> > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > some other specific CAPs, but they can't inspect their bpf programs
> > > in a
> > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > >
> > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf
> > > object
> > > > > which is not created by the process itself is with SCM_RIGHTS, that
> > > > > requires each processes which created bpf object has to implement a
> > > unix
> > > > > domain socket to share the fd of a bpf object between different
> > > > > processes, that is really trivial and troublesome.
> > > >
> > > > > Hence we need a better mechanism to get bpf object info without
> > > > > CAP_SYS_ADMIN.
> > > >
> > > > [..]
> > > >
> > > > > BPF namespace is introduced in this patchset with an attempt to remove
> > > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > > > > link in a specific bpf namespace, then these bpf objects will not be
> > > > > visible to the users in a different bpf namespace. But these bpf
> > > > > objects are visible to its parent bpf namespace, so the sys admin can
> > > > > still iterate and inspect them.
> > > >
> > > > Does it essentially mean unpriv bpf?
> >
> > > Right. With CAP_BPF and some other CAPs enabled.
> >
> > > > Can I, as a non-root, create
> > > > a new bpf namespace and start loading/attaching progs?
> >
> > > No, you can't create a new bpf namespace as a non-root, see also
> > > copy_namespaces().
> > > In the container environment, new namespaces are always created by
> > > containered, which is started by root.
> >
> > Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part
> > from copy_namespaces? Isn't it trivially bypassed with a new user
> > namespace?
> >
> > IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN
> > in this particular user-ns. Then I can go on and create a new bpf
> > namespace (with CAP_BPF) and go wild? I won't see anything from the
> > other namespaces, but I'll be able to load/attach bpf programs?
> >
>
> I don't think so. If you create a new userspace, and give the process
> the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the
> initial namespace, you can't do that. Because currently only CAP_BPF
> or CAP_SYS_ADMIN in the init user namespace can load/attach bpf
> programs.
>
> > > > Maybe add a paragraph about now vs whatever you're proposing.
> >
> > > What I'm proposing in this patchset is to put bpf objects (map, prog,
> > > link, and btf) into the bpf namespace. Next step I will put bpffs into
> > > the bpf namespace as well.
> > > That said, I'm trying to put  all the objects created in bpf into the
> > > bpf namespace. Below is a simple paragraph to illustrate it.
> >
> > > Regarding the unpriv user with CAP_BPF enabled,
> > >                                                               Now | Future
> > > ------------------------------------------------------------------------
> > > Iterate his BPF IDs                                | N   |  Y  |
> > > Iterate others' BPF IDs                          | N   |  N  |
> > > Convert his BPF IDs to FDs                  | N   |  Y  |
> > > Convert others' BPF IDs to FDs            | N   |  N  |
> > > Get others' object info from pinned file  | Y(*) | N  |
> > > ------------------------------------------------------------------------
> >
> > > (*) It can be improved by,
> > >       1). Different containers has different bpffs
> > >       2). Setting file permission
> > >       That's not perfect, for example, if one single user has two bpf
> > > instances, but we don't want them to inspect each other.
> >
> > I think the question here is what happens to the existing
> > capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually?
> >
>
> They won't become ns_capable(CAP_BPF). If it becomes
> ns_capable(CAP_BPF), it will really go wild then.
>
> > And if not, I don't think it integrates well with the user namespaces?
> >
>
> IIUC, it is the CAP_BPF which doesn't integrate with the user
> namespaces, right?

Yeah. And it's probably fine if we don't, I just wanted to see some
explanation on the long-term plan.
If the purpose is to have a bpf namespace and use it for pure
isolation purposes, let's state it clearly in the cover letter.
Otherwise it's not clear whether it's only about isolation or
potentially allowing CAP_BPF in user namespaces.
Maybe clone(CLONE_NEWBPF|CLONE_NEWUSER) should return an explicit
error? (or maybe it already does, haven't looked at the patches)

One other question I have is: should init bpf namespace see
everything? Otherwise it might be hard to chase down the namespaces
that loaded some BPF program...




> --
> Regards
> Yafang

Yafang Shao March 30, 2023, 2:40 a.m. UTC | #12

On Thu, Mar 30, 2023 at 4:50 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Mar 28, 2023 at 8:03 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 03/28, Yafang Shao wrote:
> > > > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > On 03/26, Yafang Shao wrote:
> > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert
> > > > IDs
> > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > prevent non-privilidged users from getting other users' bpf program,
> > > > but
> > > > > > also it prevents the user from iterating his own bpf objects.
> > > > >
> > > > > > In container environment, some users want to run bpf programs in their
> > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > some other specific CAPs, but they can't inspect their bpf programs
> > > > in a
> > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > >
> > > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf
> > > > object
> > > > > > which is not created by the process itself is with SCM_RIGHTS, that
> > > > > > requires each processes which created bpf object has to implement a
> > > > unix
> > > > > > domain socket to share the fd of a bpf object between different
> > > > > > processes, that is really trivial and troublesome.
> > > > >
> > > > > > Hence we need a better mechanism to get bpf object info without
> > > > > > CAP_SYS_ADMIN.
> > > > >
> > > > > [..]
> > > > >
> > > > > > BPF namespace is introduced in this patchset with an attempt to remove
> > > > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > > > > > link in a specific bpf namespace, then these bpf objects will not be
> > > > > > visible to the users in a different bpf namespace. But these bpf
> > > > > > objects are visible to its parent bpf namespace, so the sys admin can
> > > > > > still iterate and inspect them.
> > > > >
> > > > > Does it essentially mean unpriv bpf?
> > >
> > > > Right. With CAP_BPF and some other CAPs enabled.
> > >
> > > > > Can I, as a non-root, create
> > > > > a new bpf namespace and start loading/attaching progs?
> > >
> > > > No, you can't create a new bpf namespace as a non-root, see also
> > > > copy_namespaces().
> > > > In the container environment, new namespaces are always created by
> > > > containered, which is started by root.
> > >
> > > Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part
> > > from copy_namespaces? Isn't it trivially bypassed with a new user
> > > namespace?
> > >
> > > IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN
> > > in this particular user-ns. Then I can go on and create a new bpf
> > > namespace (with CAP_BPF) and go wild? I won't see anything from the
> > > other namespaces, but I'll be able to load/attach bpf programs?
> > >
> >
> > I don't think so. If you create a new userspace, and give the process
> > the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the
> > initial namespace, you can't do that. Because currently only CAP_BPF
> > or CAP_SYS_ADMIN in the init user namespace can load/attach bpf
> > programs.
> >
> > > > > Maybe add a paragraph about now vs whatever you're proposing.
> > >
> > > > What I'm proposing in this patchset is to put bpf objects (map, prog,
> > > > link, and btf) into the bpf namespace. Next step I will put bpffs into
> > > > the bpf namespace as well.
> > > > That said, I'm trying to put  all the objects created in bpf into the
> > > > bpf namespace. Below is a simple paragraph to illustrate it.
> > >
> > > > Regarding the unpriv user with CAP_BPF enabled,
> > > >                                                               Now | Future
> > > > ------------------------------------------------------------------------
> > > > Iterate his BPF IDs                                | N   |  Y  |
> > > > Iterate others' BPF IDs                          | N   |  N  |
> > > > Convert his BPF IDs to FDs                  | N   |  Y  |
> > > > Convert others' BPF IDs to FDs            | N   |  N  |
> > > > Get others' object info from pinned file  | Y(*) | N  |
> > > > ------------------------------------------------------------------------
> > >
> > > > (*) It can be improved by,
> > > >       1). Different containers has different bpffs
> > > >       2). Setting file permission
> > > >       That's not perfect, for example, if one single user has two bpf
> > > > instances, but we don't want them to inspect each other.
> > >
> > > I think the question here is what happens to the existing
> > > capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually?
> > >
> >
> > They won't become ns_capable(CAP_BPF). If it becomes
> > ns_capable(CAP_BPF), it will really go wild then.
> >
> > > And if not, I don't think it integrates well with the user namespaces?
> > >
> >
> > IIUC, it is the CAP_BPF which doesn't integrate with the user
> > namespaces, right?
>
> Yeah. And it's probably fine if we don't, I just wanted to see some
> explanation on the long-term plan.
> If the purpose is to have a bpf namespace and use it for pure
> isolation purposes, let's state it clearly in the cover letter.

Will add it.

> Otherwise it's not clear whether it's only about isolation or
> potentially allowing CAP_BPF in user namespaces.
> Maybe clone(CLONE_NEWBPF|CLONE_NEWUSER) should return an explicit
> error? (or maybe it already does, haven't looked at the patches)
>

Good suggestion. It should return an error.  I will change it in the
next version.

> One other question I have is: should init bpf namespace see
> everything? Otherwise it might be hard to chase down the namespaces
> that loaded some BPF program...

Right, the init bpf namespace will see everything.

Hao Luo March 31, 2023, 5:52 a.m. UTC | #13

On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
<...>
>
> BPF namespace is introduced in this patchset with an attempt to remove
> the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> link in a specific bpf namespace, then these bpf objects will not be
> visible to the users in a different bpf namespace. But these bpf
> objects are visible to its parent bpf namespace, so the sys admin can
> still iterate and inspect them.
>
> BPF namespace is similar to PID namespace, and the bpf objects are
> similar to tasks, so BPF namespace is very easy to understand. These
> patchset only implements BPF namespace for bpf map, prog and link. In the
> future we may extend it to other bpf objects like btf, bpffs and etc.
> For example, we can allow some of the BTF objects to be used in
> non-init bpf namespace, then the container user can only trace the
> processes running in his container, but can't get the information of
> tasks running in other containers.
>

Hi Yafang,

Thanks for putting effort toward enabling BPF for container users!

However, I think the cover letter can be improved. It's unclear to me
what exactly is BPF namespace, what exactly it tries to achieve and
what is its behavior. If you look at the manpage of pid namespace [1],
cgroup namespace[2], and namespace[3], they all have a very precise
definition, their goals and explain the intended behaviors well.

I felt you intended the BPF namespace to provide isolation of object
ids. That is, different views of the bpf object ids for different
processes. This is like the PID namespace. But somehow, you also
attach CAPs on top of that. That, I think, is not a namespace's job.

Well, I could be wrong, but would appreciate you adding more details
as follow-up.

Hao

[1] https://man7.org/linux/man-pages/man7/pid_namespaces.7.html
[2] https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html
[3] https://man7.org/linux/man-pages/man7/namespaces.7.html

Yafang Shao April 1, 2023, 4:32 p.m. UTC | #14

On Fri, Mar 31, 2023 at 1:52 PM Hao Luo <haoluo@google.com> wrote:
>
> On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> <...>
> >
> > BPF namespace is introduced in this patchset with an attempt to remove
> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
> > link in a specific bpf namespace, then these bpf objects will not be
> > visible to the users in a different bpf namespace. But these bpf
> > objects are visible to its parent bpf namespace, so the sys admin can
> > still iterate and inspect them.
> >
> > BPF namespace is similar to PID namespace, and the bpf objects are
> > similar to tasks, so BPF namespace is very easy to understand. These
> > patchset only implements BPF namespace for bpf map, prog and link. In the
> > future we may extend it to other bpf objects like btf, bpffs and etc.
> > For example, we can allow some of the BTF objects to be used in
> > non-init bpf namespace, then the container user can only trace the
> > processes running in his container, but can't get the information of
> > tasks running in other containers.
> >
>
> Hi Yafang,
>
> Thanks for putting effort toward enabling BPF for container users!
>
> However, I think the cover letter can be improved. It's unclear to me
> what exactly is BPF namespace, what exactly it tries to achieve and
> what is its behavior. If you look at the manpage of pid namespace [1],
> cgroup namespace[2], and namespace[3], they all have a very precise
> definition, their goals and explain the intended behaviors well.
>

Thanks for your suggestion. The covetter should be improved. I will
read the man pages of these namespaces and improve it as you
suggested.

> I felt you intended the BPF namespace to provide isolation of object
> ids. That is, different views of the bpf object ids for different
> processes. This is like the PID namespace. But somehow, you also
> attach CAPs on top of that. That, I think, is not a namespace's job.
>

Agree with you that it should be independent of CAPs.
After the bpf namespace is introduced, actually we don't need the CAPs
when the user iterates IDs or converts IDs to FDs in his bpf namespace
(except in the init bpf namespace), because these are all readonly
operations and the user can only read the bpf objects created by
himself. While the CAPs should be required when the user wants to
write something, e.g. creating a map, loading a prog. They are really
different things. I will change it in the next version.

> Well, I could be wrong, but would appreciate you adding more details
> as follow-up.
>
> Hao
>
> [1] https://man7.org/linux/man-pages/man7/pid_namespaces.7.html
> [2] https://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html
> [3] https://man7.org/linux/man-pages/man7/namespaces.7.html

Alexei Starovoitov April 2, 2023, 11:37 p.m. UTC | #15

On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote:
> >
> > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > prevent non-privilidged users from getting other users' bpf program, but
> > > also it prevents the user from iterating his own bpf objects.
> > >
> > > In container environment, some users want to run bpf programs in their
> > > containers. These users can run their bpf programs under CAP_BPF and
> > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > generic way. For example, the bpftool can't be used as it requires
> > > CAP_SYS_ADMIN. That is very inconvenient.
> >
> > Agreed that it is important to enable tools like bpftool without
> > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > namespace for this. Can we reuse some existing namespace for this?
> >
> 
> It seems we can't.

Yafang,

It's a Nack.

The only thing you've been trying to "solve" with bpf namespace is to
allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
The concept of bpf namespace is not even close to be thought through.
Others pointed out the gaps in the design. Like bpffs. There are plenty.
Please do not send patches like this in the future.
You need to start with describing the problem you want to solve,
then propose _several_ solutions, describe their pros and cons,
solicit feedback, present at the conferences (like LSFMMBPF or LPC),
and when the community agrees that 1. problem is worth solving,
2. the solution makes sense, only then work on patches.

"In container environment, some users want to run bpf programs in their containers."
is something Song brought up at LSFMMBPF a year ago.
At that meeting most of the folks agreed that there is a need to run bpf
in containers and make sure that the effect of bpf prog is limited to a container.
This new namespace that creates virtual IDs for progs and maps doesn't come
close in solving this task.

Yafang Shao April 3, 2023, 3:05 a.m. UTC | #16

On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote:
> > >
> > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > also it prevents the user from iterating his own bpf objects.
> > > >
> > > > In container environment, some users want to run bpf programs in their
> > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > generic way. For example, the bpftool can't be used as it requires
> > > > CAP_SYS_ADMIN. That is very inconvenient.
> > >
> > > Agreed that it is important to enable tools like bpftool without
> > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > namespace for this. Can we reuse some existing namespace for this?
> > >
> >
> > It seems we can't.
>
> Yafang,
>
> It's a Nack.
>
> The only thing you've been trying to "solve" with bpf namespace is to
> allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> The concept of bpf namespace is not even close to be thought through.

Right, it is more likely a PoC in its current state.

> Others pointed out the gaps in the design. Like bpffs. There are plenty.
> Please do not send patches like this in the future.

The reason I sent it with an early state is that I want to get some
early feedback from the community ahead of the LSF/MM/BPF workshop,
then I can improve it based on these feedbacks and present it more
specifically at the workshop. Then the discussion will be more
effective.

> You need to start with describing the problem you want to solve,
> then propose _several_ solutions, describe their pros and cons,
> solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> and when the community agrees that 1. problem is worth solving,
> 2. the solution makes sense, only then work on patches.
>

I would like to give a short discussion on the BPF namespace if
everything goes fine.

> "In container environment, some users want to run bpf programs in their containers."
> is something Song brought up at LSFMMBPF a year ago.
> At that meeting most of the folks agreed that there is a need to run bpf
> in containers and make sure that the effect of bpf prog is limited to a container.
> This new namespace that creates virtual IDs for progs and maps doesn't come
> close in solving this task.

Currently in our production environment, all the containers running
bpf programs are privileged, that is risky. So actually the goal of
the BPF namespace is to make them (or part of them) non-privileged.
But some of the abilities of these bpf programs will be lost in this
procedure, like the debug-bility with bpftool, so we need to fix it.
Agree with you that this goal is far from making bpf programs safely
running in a container environment.

Alexei Starovoitov April 3, 2023, 10:50 p.m. UTC | #17

On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote:
> > > >
> > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > also it prevents the user from iterating his own bpf objects.
> > > > >
> > > > > In container environment, some users want to run bpf programs in their
> > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > >
> > > > Agreed that it is important to enable tools like bpftool without
> > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > namespace for this. Can we reuse some existing namespace for this?
> > > >
> > >
> > > It seems we can't.
> >
> > Yafang,
> >
> > It's a Nack.
> >
> > The only thing you've been trying to "solve" with bpf namespace is to
> > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > The concept of bpf namespace is not even close to be thought through.
> 
> Right, it is more likely a PoC in its current state.
> 
> > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > Please do not send patches like this in the future.
> 
> The reason I sent it with an early state is that I want to get some
> early feedback from the community ahead of the LSF/MM/BPF workshop,
> then I can improve it based on these feedbacks and present it more
> specifically at the workshop. Then the discussion will be more
> effective.
> 
> > You need to start with describing the problem you want to solve,
> > then propose _several_ solutions, describe their pros and cons,
> > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > and when the community agrees that 1. problem is worth solving,
> > 2. the solution makes sense, only then work on patches.
> >
> 
> I would like to give a short discussion on the BPF namespace if
> everything goes fine.

Not in this shape of BPF namespace as done in this patch set.
We've talked about BPF namespace in the past. This is not it.

> > "In container environment, some users want to run bpf programs in their containers."
> > is something Song brought up at LSFMMBPF a year ago.
> > At that meeting most of the folks agreed that there is a need to run bpf
> > in containers and make sure that the effect of bpf prog is limited to a container.
> > This new namespace that creates virtual IDs for progs and maps doesn't come
> > close in solving this task.
> 
> Currently in our production environment, all the containers running
> bpf programs are privileged, that is risky. So actually the goal of
> the BPF namespace is to make them (or part of them) non-privileged.
> But some of the abilities of these bpf programs will be lost in this
> procedure, like the debug-bility with bpftool, so we need to fix it.
> Agree with you that this goal is far from making bpf programs safely
> running in a container environment.

I disagree that allowing admin to run bpftool without sudo is a task
worth solving. The visibility of bpf progs in a container is a different task.
Without doing any kernel changes we can add a flag to bpftool to let
'bpftool prog show' list progs that were loaded by processes in the same cgroup.
bpftool already does prog->pid mapping with bpf iterators.
It can filter by cgroup just as well.

Yafang Shao April 4, 2023, 2:59 a.m. UTC | #18

On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote:
> > > > >
> > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > > also it prevents the user from iterating his own bpf objects.
> > > > > >
> > > > > > In container environment, some users want to run bpf programs in their
> > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > >
> > > > > Agreed that it is important to enable tools like bpftool without
> > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > > namespace for this. Can we reuse some existing namespace for this?
> > > > >
> > > >
> > > > It seems we can't.
> > >
> > > Yafang,
> > >
> > > It's a Nack.
> > >
> > > The only thing you've been trying to "solve" with bpf namespace is to
> > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > > The concept of bpf namespace is not even close to be thought through.
> >
> > Right, it is more likely a PoC in its current state.
> >
> > > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > > Please do not send patches like this in the future.
> >
> > The reason I sent it with an early state is that I want to get some
> > early feedback from the community ahead of the LSF/MM/BPF workshop,
> > then I can improve it based on these feedbacks and present it more
> > specifically at the workshop. Then the discussion will be more
> > effective.
> >
> > > You need to start with describing the problem you want to solve,
> > > then propose _several_ solutions, describe their pros and cons,
> > > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > > and when the community agrees that 1. problem is worth solving,
> > > 2. the solution makes sense, only then work on patches.
> > >
> >
> > I would like to give a short discussion on the BPF namespace if
> > everything goes fine.
>
> Not in this shape of BPF namespace as done in this patch set.
> We've talked about BPF namespace in the past. This is not it.
>
> > > "In container environment, some users want to run bpf programs in their containers."
> > > is something Song brought up at LSFMMBPF a year ago.
> > > At that meeting most of the folks agreed that there is a need to run bpf
> > > in containers and make sure that the effect of bpf prog is limited to a container.
> > > This new namespace that creates virtual IDs for progs and maps doesn't come
> > > close in solving this task.
> >
> > Currently in our production environment, all the containers running
> > bpf programs are privileged, that is risky. So actually the goal of
> > the BPF namespace is to make them (or part of them) non-privileged.
> > But some of the abilities of these bpf programs will be lost in this
> > procedure, like the debug-bility with bpftool, so we need to fix it.
> > Agree with you that this goal is far from making bpf programs safely
> > running in a container environment.
>
> I disagree that allowing admin to run bpftool without sudo is a task
> worth solving. The visibility of bpf progs in a container is a different task.
> Without doing any kernel changes we can add a flag to bpftool to let
> 'bpftool prog show' list progs that were loaded by processes in the same cgroup.
> bpftool already does prog->pid mapping with bpf iterators.
> It can filter by cgroup just as well.

IIUC, at least we need bellow change in the kernel,

--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
        if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
                return -EINVAL;

-       if (!capable(CAP_SYS_ADMIN))
-               return -EPERM;
-
        next_id++;
        spin_lock_bh(lock);
        if (!idr_get_next(idr, &next_id))

Because the container doesn't have CAP_SYS_ADMIN enabled, while they
only have CAP_BPF and other required CAPs.

Another possible solution is that we run an agent in the host, and the
user in the container who wants to get the bpf objects info in his
container should send a request to this agent via unix domain socket.
That is what we are doing now in our production environment.  That
said, each container has to run a client to get the bpf object fd.
There are some downsides,
-  It can't handle pinned bpf programs
   For pinned programs, the user can get them from the pinned files
directly, so he can use bpftool in his case, only with some
complaints.
-  If the user attached the bpf prog, and then removed the pinned
file, but didn't detach it.
   That happened. But this error case can't be handled.
- There may be other corner cases that it can't fit.

There's a solution to improve it, but we also need to change the
kernel. That is, we can use the wasted space btf->name.

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index b7e5a55..59d73a3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
u32 btf_data_size,
                err = -ENOMEM;
                goto errout;
        }
+       snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
+                        current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
        env->btf = btf;

        data = kvmalloc(btf_data_size, GFP_KERNEL | __GFP_NOWARN);

Alexei Starovoitov April 6, 2023, 2:06 a.m. UTC | #19

On Tue, Apr 04, 2023 at 10:59:55AM +0800, Yafang Shao wrote:
> On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> > > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote:
> > > > > >
> > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > > > also it prevents the user from iterating his own bpf objects.
> > > > > > >
> > > > > > > In container environment, some users want to run bpf programs in their
> > > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > > >
> > > > > > Agreed that it is important to enable tools like bpftool without
> > > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > > > namespace for this. Can we reuse some existing namespace for this?
> > > > > >
> > > > >
> > > > > It seems we can't.
> > > >
> > > > Yafang,
> > > >
> > > > It's a Nack.
> > > >
> > > > The only thing you've been trying to "solve" with bpf namespace is to
> > > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > > > The concept of bpf namespace is not even close to be thought through.
> > >
> > > Right, it is more likely a PoC in its current state.
> > >
> > > > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > > > Please do not send patches like this in the future.
> > >
> > > The reason I sent it with an early state is that I want to get some
> > > early feedback from the community ahead of the LSF/MM/BPF workshop,
> > > then I can improve it based on these feedbacks and present it more
> > > specifically at the workshop. Then the discussion will be more
> > > effective.
> > >
> > > > You need to start with describing the problem you want to solve,
> > > > then propose _several_ solutions, describe their pros and cons,
> > > > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > > > and when the community agrees that 1. problem is worth solving,
> > > > 2. the solution makes sense, only then work on patches.
> > > >
> > >
> > > I would like to give a short discussion on the BPF namespace if
> > > everything goes fine.
> >
> > Not in this shape of BPF namespace as done in this patch set.
> > We've talked about BPF namespace in the past. This is not it.
> >
> > > > "In container environment, some users want to run bpf programs in their containers."
> > > > is something Song brought up at LSFMMBPF a year ago.
> > > > At that meeting most of the folks agreed that there is a need to run bpf
> > > > in containers and make sure that the effect of bpf prog is limited to a container.
> > > > This new namespace that creates virtual IDs for progs and maps doesn't come
> > > > close in solving this task.
> > >
> > > Currently in our production environment, all the containers running
> > > bpf programs are privileged, that is risky. So actually the goal of
> > > the BPF namespace is to make them (or part of them) non-privileged.
> > > But some of the abilities of these bpf programs will be lost in this
> > > procedure, like the debug-bility with bpftool, so we need to fix it.
> > > Agree with you that this goal is far from making bpf programs safely
> > > running in a container environment.
> >
> > I disagree that allowing admin to run bpftool without sudo is a task
> > worth solving. The visibility of bpf progs in a container is a different task.
> > Without doing any kernel changes we can add a flag to bpftool to let
> > 'bpftool prog show' list progs that were loaded by processes in the same cgroup.
> > bpftool already does prog->pid mapping with bpf iterators.
> > It can filter by cgroup just as well.
> 
> IIUC, at least we need bellow change in the kernel,

No. The user should just 'sudo bpftool ...' instead.

> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
>         if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
>                 return -EINVAL;
> 
> -       if (!capable(CAP_SYS_ADMIN))
> -               return -EPERM;
> -
>         next_id++;
>         spin_lock_bh(lock);
>         if (!idr_get_next(idr, &next_id))
> 
> Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> only have CAP_BPF and other required CAPs.
> 
> Another possible solution is that we run an agent in the host, and the
> user in the container who wants to get the bpf objects info in his
> container should send a request to this agent via unix domain socket.
> That is what we are doing now in our production environment.  That
> said, each container has to run a client to get the bpf object fd.

None of such hacks are necessary. People that debug bpf setups with bpftool
can always sudo.

> There are some downsides,
> -  It can't handle pinned bpf programs
>    For pinned programs, the user can get them from the pinned files
> directly, so he can use bpftool in his case, only with some
> complaints.
> -  If the user attached the bpf prog, and then removed the pinned
> file, but didn't detach it.
>    That happened. But this error case can't be handled.
> - There may be other corner cases that it can't fit.
> 
> There's a solution to improve it, but we also need to change the
> kernel. That is, we can use the wasted space btf->name.
> 
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index b7e5a55..59d73a3 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> u32 btf_data_size,
>                 err = -ENOMEM;
>                 goto errout;
>         }
> +       snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> +                        current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));

Unnecessary.
comm, pid, cgroup can be printed by bpftool without changing the kernel.

Yafang Shao April 6, 2023, 2:54 a.m. UTC | #20

On Thu, Apr 6, 2023 at 10:07 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Apr 04, 2023 at 10:59:55AM +0800, Yafang Shao wrote:
> > On Tue, Apr 4, 2023 at 6:50 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Apr 03, 2023 at 11:05:25AM +0800, Yafang Shao wrote:
> > > > On Mon, Apr 3, 2023 at 7:37 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Tue, Mar 28, 2023 at 11:47:31AM +0800, Yafang Shao wrote:
> > > > > > On Tue, Mar 28, 2023 at 3:04 AM Song Liu <song@kernel.org> wrote:
> > > > > > >
> > > > > > > On Sun, Mar 26, 2023 at 2:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
> > > > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it
> > > > > > > > prevent non-privilidged users from getting other users' bpf program, but
> > > > > > > > also it prevents the user from iterating his own bpf objects.
> > > > > > > >
> > > > > > > > In container environment, some users want to run bpf programs in their
> > > > > > > > containers. These users can run their bpf programs under CAP_BPF and
> > > > > > > > some other specific CAPs, but they can't inspect their bpf programs in a
> > > > > > > > generic way. For example, the bpftool can't be used as it requires
> > > > > > > > CAP_SYS_ADMIN. That is very inconvenient.
> > > > > > >
> > > > > > > Agreed that it is important to enable tools like bpftool without
> > > > > > > CAP_SYS_ADMIN. However, I am not sure whether we need a new
> > > > > > > namespace for this. Can we reuse some existing namespace for this?
> > > > > > >
> > > > > >
> > > > > > It seems we can't.
> > > > >
> > > > > Yafang,
> > > > >
> > > > > It's a Nack.
> > > > >
> > > > > The only thing you've been trying to "solve" with bpf namespace is to
> > > > > allow 'bpftool prog show' iterate progs in the "namespace" without CAP_SYS_ADMIN.
> > > > > The concept of bpf namespace is not even close to be thought through.
> > > >
> > > > Right, it is more likely a PoC in its current state.
> > > >
> > > > > Others pointed out the gaps in the design. Like bpffs. There are plenty.
> > > > > Please do not send patches like this in the future.
> > > >
> > > > The reason I sent it with an early state is that I want to get some
> > > > early feedback from the community ahead of the LSF/MM/BPF workshop,
> > > > then I can improve it based on these feedbacks and present it more
> > > > specifically at the workshop. Then the discussion will be more
> > > > effective.
> > > >
> > > > > You need to start with describing the problem you want to solve,
> > > > > then propose _several_ solutions, describe their pros and cons,
> > > > > solicit feedback, present at the conferences (like LSFMMBPF or LPC),
> > > > > and when the community agrees that 1. problem is worth solving,
> > > > > 2. the solution makes sense, only then work on patches.
> > > > >
> > > >
> > > > I would like to give a short discussion on the BPF namespace if
> > > > everything goes fine.
> > >
> > > Not in this shape of BPF namespace as done in this patch set.
> > > We've talked about BPF namespace in the past. This is not it.
> > >
> > > > > "In container environment, some users want to run bpf programs in their containers."
> > > > > is something Song brought up at LSFMMBPF a year ago.
> > > > > At that meeting most of the folks agreed that there is a need to run bpf
> > > > > in containers and make sure that the effect of bpf prog is limited to a container.
> > > > > This new namespace that creates virtual IDs for progs and maps doesn't come
> > > > > close in solving this task.
> > > >
> > > > Currently in our production environment, all the containers running
> > > > bpf programs are privileged, that is risky. So actually the goal of
> > > > the BPF namespace is to make them (or part of them) non-privileged.
> > > > But some of the abilities of these bpf programs will be lost in this
> > > > procedure, like the debug-bility with bpftool, so we need to fix it.
> > > > Agree with you that this goal is far from making bpf programs safely
> > > > running in a container environment.
> > >
> > > I disagree that allowing admin to run bpftool without sudo is a task
> > > worth solving. The visibility of bpf progs in a container is a different task.
> > > Without doing any kernel changes we can add a flag to bpftool to let
> > > 'bpftool prog show' list progs that were loaded by processes in the same cgroup.
> > > bpftool already does prog->pid mapping with bpf iterators.
> > > It can filter by cgroup just as well.
> >
> > IIUC, at least we need bellow change in the kernel,
>
> No. The user should just 'sudo bpftool ...' instead.
>

It seems that I didn't describe the issue clearly.
The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
required to run bpftool,  so the bpftool running in the container
can't get the ID of bpf objects or convert IDs to FDs.
Is there something that I missed ?

> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
> >         if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
> >                 return -EINVAL;
> >
> > -       if (!capable(CAP_SYS_ADMIN))
> > -               return -EPERM;
> > -
> >         next_id++;
> >         spin_lock_bh(lock);
> >         if (!idr_get_next(idr, &next_id))
> >
> > Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> > only have CAP_BPF and other required CAPs.
> >
> > Another possible solution is that we run an agent in the host, and the
> > user in the container who wants to get the bpf objects info in his
> > container should send a request to this agent via unix domain socket.
> > That is what we are doing now in our production environment.  That
> > said, each container has to run a client to get the bpf object fd.
>
> None of such hacks are necessary. People that debug bpf setups with bpftool
> can always sudo.
>
> > There are some downsides,
> > -  It can't handle pinned bpf programs
> >    For pinned programs, the user can get them from the pinned files
> > directly, so he can use bpftool in his case, only with some
> > complaints.
> > -  If the user attached the bpf prog, and then removed the pinned
> > file, but didn't detach it.
> >    That happened. But this error case can't be handled.
> > - There may be other corner cases that it can't fit.
> >
> > There's a solution to improve it, but we also need to change the
> > kernel. That is, we can use the wasted space btf->name.
> >
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index b7e5a55..59d73a3 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> > u32 btf_data_size,
> >                 err = -ENOMEM;
> >                 goto errout;
> >         }
> > +       snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> > +                        current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
>
> Unnecessary.
> comm, pid, cgroup can be printed by bpftool without changing the kernel.

Some questions,
- What if the process exits after attaching the bpf prog and the prog
is not auto-detachable?
  For example, the reuserport bpf prog is not auto-detachable. After
pins the reuserport bpf prog, a task can attach it through the pinned
bpf file, but if the task forgets to detach it and the pinned file is
removed, then it seems there's no way to figure out which task or
cgroup this prog belongs to...
- Could you pls. explain in detail how to get comm, pid, or cgroup
from a pinned bpffs file?

Alexei Starovoitov April 6, 2023, 3:05 a.m. UTC | #21

On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> It seems that I didn't describe the issue clearly.
> The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> required to run bpftool,  so the bpftool running in the container
> can't get the ID of bpf objects or convert IDs to FDs.
> Is there something that I missed ?

Nothing. This is by design. bpftool needs sudo. That's all.


>
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
> > >         if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
> > >                 return -EINVAL;
> > >
> > > -       if (!capable(CAP_SYS_ADMIN))
> > > -               return -EPERM;
> > > -
> > >         next_id++;
> > >         spin_lock_bh(lock);
> > >         if (!idr_get_next(idr, &next_id))
> > >
> > > Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> > > only have CAP_BPF and other required CAPs.
> > >
> > > Another possible solution is that we run an agent in the host, and the
> > > user in the container who wants to get the bpf objects info in his
> > > container should send a request to this agent via unix domain socket.
> > > That is what we are doing now in our production environment.  That
> > > said, each container has to run a client to get the bpf object fd.
> >
> > None of such hacks are necessary. People that debug bpf setups with bpftool
> > can always sudo.
> >
> > > There are some downsides,
> > > -  It can't handle pinned bpf programs
> > >    For pinned programs, the user can get them from the pinned files
> > > directly, so he can use bpftool in his case, only with some
> > > complaints.
> > > -  If the user attached the bpf prog, and then removed the pinned
> > > file, but didn't detach it.
> > >    That happened. But this error case can't be handled.
> > > - There may be other corner cases that it can't fit.
> > >
> > > There's a solution to improve it, but we also need to change the
> > > kernel. That is, we can use the wasted space btf->name.
> > >
> > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > index b7e5a55..59d73a3 100644
> > > --- a/kernel/bpf/btf.c
> > > +++ b/kernel/bpf/btf.c
> > > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> > > u32 btf_data_size,
> > >                 err = -ENOMEM;
> > >                 goto errout;
> > >         }
> > > +       snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> > > +                        current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
> >
> > Unnecessary.
> > comm, pid, cgroup can be printed by bpftool without changing the kernel.
>
> Some questions,
> - What if the process exits after attaching the bpf prog and the prog
> is not auto-detachable?
>   For example, the reuserport bpf prog is not auto-detachable. After
> pins the reuserport bpf prog, a task can attach it through the pinned
> bpf file, but if the task forgets to detach it and the pinned file is
> removed, then it seems there's no way to figure out which task or
> cgroup this prog belongs to...

you're saying that there is a bpf prog in the kernel without
corresponding user space ? Meaning no user space process has an FD
that points to this prog or FD to a map that this prog is using?
In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.

> - Could you pls. explain in detail how to get comm, pid, or cgroup
> from a pinned bpffs file?

pinned bpf prog and no user space holds FD to it?
It's not part of any cgroup. Nothing to print.

Yafang Shao April 6, 2023, 3:22 a.m. UTC | #22

On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > It seems that I didn't describe the issue clearly.
> > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > required to run bpftool,  so the bpftool running in the container
> > can't get the ID of bpf objects or convert IDs to FDs.
> > Is there something that I missed ?
>
> Nothing. This is by design. bpftool needs sudo. That's all.
>

Hmm, what I'm trying to do is make bpftool run without sudo.

>
> >
> > > > --- a/kernel/bpf/syscall.c
> > > > +++ b/kernel/bpf/syscall.c
> > > > @@ -3705,9 +3705,6 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
> > > >         if (CHECK_ATTR(BPF_OBJ_GET_NEXT_ID) || next_id >= INT_MAX)
> > > >                 return -EINVAL;
> > > >
> > > > -       if (!capable(CAP_SYS_ADMIN))
> > > > -               return -EPERM;
> > > > -
> > > >         next_id++;
> > > >         spin_lock_bh(lock);
> > > >         if (!idr_get_next(idr, &next_id))
> > > >
> > > > Because the container doesn't have CAP_SYS_ADMIN enabled, while they
> > > > only have CAP_BPF and other required CAPs.
> > > >
> > > > Another possible solution is that we run an agent in the host, and the
> > > > user in the container who wants to get the bpf objects info in his
> > > > container should send a request to this agent via unix domain socket.
> > > > That is what we are doing now in our production environment.  That
> > > > said, each container has to run a client to get the bpf object fd.
> > >
> > > None of such hacks are necessary. People that debug bpf setups with bpftool
> > > can always sudo.
> > >
> > > > There are some downsides,
> > > > -  It can't handle pinned bpf programs
> > > >    For pinned programs, the user can get them from the pinned files
> > > > directly, so he can use bpftool in his case, only with some
> > > > complaints.
> > > > -  If the user attached the bpf prog, and then removed the pinned
> > > > file, but didn't detach it.
> > > >    That happened. But this error case can't be handled.
> > > > - There may be other corner cases that it can't fit.
> > > >
> > > > There's a solution to improve it, but we also need to change the
> > > > kernel. That is, we can use the wasted space btf->name.
> > > >
> > > > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > > > index b7e5a55..59d73a3 100644
> > > > --- a/kernel/bpf/btf.c
> > > > +++ b/kernel/bpf/btf.c
> > > > @@ -5542,6 +5542,8 @@ static struct btf *btf_parse(bpfptr_t btf_data,
> > > > u32 btf_data_size,
> > > >                 err = -ENOMEM;
> > > >                 goto errout;
> > > >         }
> > > > +       snprintf(btf->name, sizeof(btf->name), "%s-%d-%d", current->comm,
> > > > +                        current->pid, cgroup_id(task_cgroup(p, cpu_cgrp_id)));
> > >
> > > Unnecessary.
> > > comm, pid, cgroup can be printed by bpftool without changing the kernel.
> >
> > Some questions,
> > - What if the process exits after attaching the bpf prog and the prog
> > is not auto-detachable?
> >   For example, the reuserport bpf prog is not auto-detachable. After
> > pins the reuserport bpf prog, a task can attach it through the pinned
> > bpf file, but if the task forgets to detach it and the pinned file is
> > removed, then it seems there's no way to figure out which task or
> > cgroup this prog belongs to...
>
> you're saying that there is a bpf prog in the kernel without
> corresponding user space ?

No, it is corresponding to user space. For example, it may be
corresponding to a socket fd, or a cgroup fd.

> Meaning no user space process has an FD
> that points to this prog or FD to a map that this prog is using?
> In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
>

Even if it is kernel bpf prog, it is created by a process. The user
needs to know which one created it.

> > - Could you pls. explain in detail how to get comm, pid, or cgroup
> > from a pinned bpffs file?
>
> pinned bpf prog and no user space holds FD to it?
> It's not part of any cgroup. Nothing to print.

As I explained above, even if it holds nothing, the user needs to know
the information from it. For example, if it is expected, which one
created it?

Alexei Starovoitov April 6, 2023, 4:24 a.m. UTC | #23

On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > It seems that I didn't describe the issue clearly.
> > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > required to run bpftool,  so the bpftool running in the container
> > > can't get the ID of bpf objects or convert IDs to FDs.
> > > Is there something that I missed ?
> >
> > Nothing. This is by design. bpftool needs sudo. That's all.
> >
>
> Hmm, what I'm trying to do is make bpftool run without sudo.

This is not a task that is worth solving.

> > > Some questions,
> > > - What if the process exits after attaching the bpf prog and the prog
> > > is not auto-detachable?
> > >   For example, the reuserport bpf prog is not auto-detachable. After
> > > pins the reuserport bpf prog, a task can attach it through the pinned
> > > bpf file, but if the task forgets to detach it and the pinned file is
> > > removed, then it seems there's no way to figure out which task or
> > > cgroup this prog belongs to...
> >
> > you're saying that there is a bpf prog in the kernel without
> > corresponding user space ?
>
> No, it is corresponding to user space. For example, it may be
> corresponding to a socket fd, or a cgroup fd.
>
> > Meaning no user space process has an FD
> > that points to this prog or FD to a map that this prog is using?
> > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
> >
>
> Even if it is kernel bpf prog, it is created by a process. The user
> needs to know which one created it.

In some situations it's certainly interesting to know which process
loaded a particular program.
In many other situations it's irrelevant.
For example, the process that loaded a prog could have been moved to a
different cgroup.
If you want to track the loading you need to install bpf_lsm
that monitors prog_load hook and collect that info.
It's not the job of the kernel to do it.

> > > - Could you pls. explain in detail how to get comm, pid, or cgroup
> > > from a pinned bpffs file?
> >
> > pinned bpf prog and no user space holds FD to it?
> > It's not part of any cgroup. Nothing to print.
>
> As I explained above, even if it holds nothing, the user needs to know
> the information from it. For example, if it is expected, which one
> created it?

See the answer above. The kernel has enough hooks already to provide
this information to user space. No kernel changes necessary.

Yafang Shao April 6, 2023, 5:43 a.m. UTC | #24

On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > It seems that I didn't describe the issue clearly.
> > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > required to run bpftool,  so the bpftool running in the container
> > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > Is there something that I missed ?
> > >
> > > Nothing. This is by design. bpftool needs sudo. That's all.
> > >
> >
> > Hmm, what I'm trying to do is make bpftool run without sudo.
>
> This is not a task that is worth solving.
>

Then the container with CAP_BPF enabled can't even iterate its bpf progs ...

> > > > Some questions,
> > > > - What if the process exits after attaching the bpf prog and the prog
> > > > is not auto-detachable?
> > > >   For example, the reuserport bpf prog is not auto-detachable. After
> > > > pins the reuserport bpf prog, a task can attach it through the pinned
> > > > bpf file, but if the task forgets to detach it and the pinned file is
> > > > removed, then it seems there's no way to figure out which task or
> > > > cgroup this prog belongs to...
> > >
> > > you're saying that there is a bpf prog in the kernel without
> > > corresponding user space ?
> >
> > No, it is corresponding to user space. For example, it may be
> > corresponding to a socket fd, or a cgroup fd.
> >
> > > Meaning no user space process has an FD
> > > that points to this prog or FD to a map that this prog is using?
> > > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
> > >
> >
> > Even if it is kernel bpf prog, it is created by a process. The user
> > needs to know which one created it.
>
> In some situations it's certainly interesting to know which process
> loaded a particular program.
> In many other situations it's irrelevant.
> For example, the process that loaded a prog could have been moved to a
> different cgroup.
> If you want to track the loading you need to install bpf_lsm
> that monitors prog_load hook and collect that info.
> It's not the job of the kernel to do it.
>

Agreed with you that we can add lots of hooks to track every detail of
the operations.
But it is not free. More hooks, more overhead.
If we can change the kernel to make it lightweight, why not...

> > > > - Could you pls. explain in detail how to get comm, pid, or cgroup
> > > > from a pinned bpffs file?
> > >
> > > pinned bpf prog and no user space holds FD to it?
> > > It's not part of any cgroup. Nothing to print.
> >
> > As I explained above, even if it holds nothing, the user needs to know
> > the information from it. For example, if it is expected, which one
> > created it?
>
> See the answer above. The kernel has enough hooks already to provide
> this information to user space. No kernel changes necessary.

Andrii Nakryiko April 6, 2023, 8:22 p.m. UTC | #25

On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > It seems that I didn't describe the issue clearly.
> > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > required to run bpftool,  so the bpftool running in the container
> > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > Is there something that I missed ?
> > > >
> > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > >
> > >
> > > Hmm, what I'm trying to do is make bpftool run without sudo.
> >
> > This is not a task that is worth solving.
> >
>
> Then the container with CAP_BPF enabled can't even iterate its bpf progs ...

I'll leave the BPF namespace discussion aside (I agree that it needs
way more thought).

I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
you to take over someone else's link and stuff like this. But just
iterating IDs seems like a pretty innocent functionality, so maybe we
should remove CAP_SYS_ADMIN for GET_NEXT_ID?

By itself GET_NEXT_ID is relatively useless without capabilities, but
we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
for a while now, and that seems useful in itself, as it would indeed
help tools like bpftool to get *some* information even without
privileges. Whether those GET_INFO_BY_ID operations should return same
full bpf_{prog,map,link,btf}_info or some trimmed down version of them
would be up to discussion, but I think getting some info without
creating an FD seems useful in itself.

Would it be worth discussing and solving this separately from
namespacing issues?

>
> > > > > Some questions,
> > > > > - What if the process exits after attaching the bpf prog and the prog
> > > > > is not auto-detachable?
> > > > >   For example, the reuserport bpf prog is not auto-detachable. After
> > > > > pins the reuserport bpf prog, a task can attach it through the pinned
> > > > > bpf file, but if the task forgets to detach it and the pinned file is
> > > > > removed, then it seems there's no way to figure out which task or
> > > > > cgroup this prog belongs to...
> > > >
> > > > you're saying that there is a bpf prog in the kernel without
> > > > corresponding user space ?
> > >
> > > No, it is corresponding to user space. For example, it may be
> > > corresponding to a socket fd, or a cgroup fd.
> > >
> > > > Meaning no user space process has an FD
> > > > that points to this prog or FD to a map that this prog is using?
> > > > In such a case this is truly kernel bpf prog. It doesn't belong to cgroup.
> > > >
> > >
> > > Even if it is kernel bpf prog, it is created by a process. The user
> > > needs to know which one created it.
> >
> > In some situations it's certainly interesting to know which process
> > loaded a particular program.
> > In many other situations it's irrelevant.
> > For example, the process that loaded a prog could have been moved to a
> > different cgroup.
> > If you want to track the loading you need to install bpf_lsm
> > that monitors prog_load hook and collect that info.
> > It's not the job of the kernel to do it.
> >
>
> Agreed with you that we can add lots of hooks to track every detail of
> the operations.
> But it is not free. More hooks, more overhead.
> If we can change the kernel to make it lightweight, why not...
>
> > > > > - Could you pls. explain in detail how to get comm, pid, or cgroup
> > > > > from a pinned bpffs file?
> > > >
> > > > pinned bpf prog and no user space holds FD to it?
> > > > It's not part of any cgroup. Nothing to print.
> > >
> > > As I explained above, even if it holds nothing, the user needs to know
> > > the information from it. For example, if it is expected, which one
> > > created it?
> >
> > See the answer above. The kernel has enough hooks already to provide
> > this information to user space. No kernel changes necessary.
>
>
>
> --
> Regards
> Yafang

Alexei Starovoitov April 7, 2023, 1:43 a.m. UTC | #26

On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > It seems that I didn't describe the issue clearly.
> > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > required to run bpftool,  so the bpftool running in the container
> > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > Is there something that I missed ?
> > > > >
> > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > >
> > > >
> > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > >
> > > This is not a task that is worth solving.
> > >
> >
> > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> 
> I'll leave the BPF namespace discussion aside (I agree that it needs
> way more thought).
> 
> I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> you to take over someone else's link and stuff like this. But just
> iterating IDs seems like a pretty innocent functionality, so maybe we
> should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> 
> By itself GET_NEXT_ID is relatively useless without capabilities, but
> we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> for a while now, and that seems useful in itself, as it would indeed
> help tools like bpftool to get *some* information even without
> privileges. Whether those GET_INFO_BY_ID operations should return same
> full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> would be up to discussion, but I think getting some info without
> creating an FD seems useful in itself.
> 
> Would it be worth discussing and solving this separately from
> namespacing issues?

Iteration of IDs itself is fine. The set of IDs is not security sensitive,
but GET_NEXT_BY_ID has to be carefully restricted.
It returns xlated, jited, BTF, line info, etc
and with all the restrictions it would need something like
CAP_SYS_PTRACE and CAP_PERFMON to be useful.
And with that we're not far from CAP_SYS_ADMIN.
Why bother then?

Yafang Shao April 7, 2023, 4:33 a.m. UTC | #27

On Fri, Apr 7, 2023 at 9:44 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > required to run bpftool,  so the bpftool running in the container
> > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > Is there something that I missed ?
> > > > > >
> > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > >
> > > > >
> > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > >
> > > > This is not a task that is worth solving.
> > > >
> > >
> > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> >
> > I'll leave the BPF namespace discussion aside (I agree that it needs
> > way more thought).
> >
> > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > you to take over someone else's link and stuff like this. But just
> > iterating IDs seems like a pretty innocent functionality, so maybe we
> > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> >
> > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > for a while now, and that seems useful in itself, as it would indeed
> > help tools like bpftool to get *some* information even without
> > privileges. Whether those GET_INFO_BY_ID operations should return same
> > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > would be up to discussion, but I think getting some info without
> > creating an FD seems useful in itself.
> >
> > Would it be worth discussing and solving this separately from
> > namespacing issues?
>
> Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> but GET_NEXT_BY_ID has to be carefully restricted.
> It returns xlated, jited, BTF, line info, etc
> and with all the restrictions it would need something like
> CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> And with that we're not far from CAP_SYS_ADMIN.
> Why bother then?

Not sure if I get your point clearly.  I think the reason we introduce
CAP_BPF is that we don't want the user to enable CAP_SYS_ADMIN.
Enabling specific CAPs instead of CAP_SYS_ADMIN should be our
alignment goal, right?
If so, then it is worth doing.  As Andrii suggested, a trimmed down
version is also helpful and should be acceptable. Agreed with you that
we have lots of things to do, but if we don't try to move it forward,
it will always be as-is.

Alexei Starovoitov April 7, 2023, 3:32 p.m. UTC | #28

On Thu, Apr 6, 2023 at 9:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Fri, Apr 7, 2023 at 9:44 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > required to run bpftool,  so the bpftool running in the container
> > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > Is there something that I missed ?
> > > > > > >
> > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > >
> > > > > >
> > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > >
> > > > > This is not a task that is worth solving.
> > > > >
> > > >
> > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > >
> > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > way more thought).
> > >
> > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > you to take over someone else's link and stuff like this. But just
> > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > >
> > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > for a while now, and that seems useful in itself, as it would indeed
> > > help tools like bpftool to get *some* information even without
> > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > would be up to discussion, but I think getting some info without
> > > creating an FD seems useful in itself.
> > >
> > > Would it be worth discussing and solving this separately from
> > > namespacing issues?
> >
> > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > but GET_NEXT_BY_ID has to be carefully restricted.
> > It returns xlated, jited, BTF, line info, etc
> > and with all the restrictions it would need something like
> > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > And with that we're not far from CAP_SYS_ADMIN.
> > Why bother then?
>
> Not sure if I get your point clearly.  I think the reason we introduce
> CAP_BPF is that we don't want the user to enable CAP_SYS_ADMIN.
> Enabling specific CAPs instead of CAP_SYS_ADMIN should be our
> alignment goal, right?

We want users to switch to CAP_BPF (potentially with CAP_PERFMON)
from full CAP_SYS_ADMIN to reduce attack surface of production workloads.
bpftool is a tool for humans to do introspection and debugging.
It will stay root only.

> If so, then it is worth doing.  As Andrii suggested, a trimmed down
> version is also helpful and should be acceptable.

It's not helpful. It's actively misleading.

Andrii Nakryiko April 7, 2023, 3:59 p.m. UTC | #29

On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > required to run bpftool,  so the bpftool running in the container
> > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > Is there something that I missed ?
> > > > > >
> > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > >
> > > > >
> > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > >
> > > > This is not a task that is worth solving.
> > > >
> > >
> > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> >
> > I'll leave the BPF namespace discussion aside (I agree that it needs
> > way more thought).
> >
> > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > you to take over someone else's link and stuff like this. But just
> > iterating IDs seems like a pretty innocent functionality, so maybe we
> > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> >
> > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > for a while now, and that seems useful in itself, as it would indeed
> > help tools like bpftool to get *some* information even without
> > privileges. Whether those GET_INFO_BY_ID operations should return same
> > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > would be up to discussion, but I think getting some info without
> > creating an FD seems useful in itself.
> >
> > Would it be worth discussing and solving this separately from
> > namespacing issues?
>
> Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> but GET_NEXT_BY_ID has to be carefully restricted.
> It returns xlated, jited, BTF, line info, etc
> and with all the restrictions it would need something like
> CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> And with that we're not far from CAP_SYS_ADMIN.
> Why bother then?

You probably meant that GET_INFO_BY_ID should be carefully restricted?
So yeah, that's what I said that this would have to be discussed
further. I agree that returning func/line info, program dump, etc is
probably a privileged part. But there is plenty of useful info besides
that (e.g., prog name, insns cnt, run stats, etc) that would be useful
for unpriv applications to monitor their own apps that they opened
from BPF FS, or just some observability daemons.

There is a lot of useful information in bpf_map_info and bpf_link_info
that's way less privileged. I think bpf_link_info is good as is. Same
for bpf_map_info.

Either way, I'm not insisting, just something that seems pretty simple
to add and useful in some scenarios. We can reuse existing code and
types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
those privileged fields you mentioned. Anyway, something to put on the
backburner, perhaps.

Alexei Starovoitov April 7, 2023, 4:05 p.m. UTC | #30

On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > required to run bpftool,  so the bpftool running in the container
> > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > Is there something that I missed ?
> > > > > > >
> > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > >
> > > > > >
> > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > >
> > > > > This is not a task that is worth solving.
> > > > >
> > > >
> > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > >
> > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > way more thought).
> > >
> > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > you to take over someone else's link and stuff like this. But just
> > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > >
> > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > for a while now, and that seems useful in itself, as it would indeed
> > > help tools like bpftool to get *some* information even without
> > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > would be up to discussion, but I think getting some info without
> > > creating an FD seems useful in itself.
> > >
> > > Would it be worth discussing and solving this separately from
> > > namespacing issues?
> >
> > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > but GET_NEXT_BY_ID has to be carefully restricted.
> > It returns xlated, jited, BTF, line info, etc
> > and with all the restrictions it would need something like
> > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > And with that we're not far from CAP_SYS_ADMIN.
> > Why bother then?
>
> You probably meant that GET_INFO_BY_ID should be carefully restricted?

yes.

> So yeah, that's what I said that this would have to be discussed
> further. I agree that returning func/line info, program dump, etc is
> probably a privileged part. But there is plenty of useful info besides
> that (e.g., prog name, insns cnt, run stats, etc) that would be useful
> for unpriv applications to monitor their own apps that they opened
> from BPF FS, or just some observability daemons.
>
> There is a lot of useful information in bpf_map_info and bpf_link_info
> that's way less privileged. I think bpf_link_info is good as is. Same
> for bpf_map_info.
>
> Either way, I'm not insisting, just something that seems pretty simple
> to add and useful in some scenarios. We can reuse existing code and
> types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
> those privileged fields you mentioned. Anyway, something to put on the
> backburner, perhaps.

Sorry, but I only see negatives. It's an extra code in the kernel
that has to be carefully reviewed when initially submitted and
then every patch that touches get_info_by_id would have to go
through a microscope every time to avoid introducing a security issue.
And for what? So that CAP_BPF application can read prog name and run stats?

Yafang Shao April 7, 2023, 4:21 p.m. UTC | #31

On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > > required to run bpftool,  so the bpftool running in the container
> > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > > Is there something that I missed ?
> > > > > > > >
> > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > > >
> > > > > > >
> > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > > >
> > > > > > This is not a task that is worth solving.
> > > > > >
> > > > >
> > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > > >
> > > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > > way more thought).
> > > >
> > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > > you to take over someone else's link and stuff like this. But just
> > > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > > >
> > > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > > for a while now, and that seems useful in itself, as it would indeed
> > > > help tools like bpftool to get *some* information even without
> > > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > > would be up to discussion, but I think getting some info without
> > > > creating an FD seems useful in itself.
> > > >
> > > > Would it be worth discussing and solving this separately from
> > > > namespacing issues?
> > >
> > > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > > but GET_NEXT_BY_ID has to be carefully restricted.
> > > It returns xlated, jited, BTF, line info, etc
> > > and with all the restrictions it would need something like
> > > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > > And with that we're not far from CAP_SYS_ADMIN.
> > > Why bother then?
> >
> > You probably meant that GET_INFO_BY_ID should be carefully restricted?
>
> yes.
>
> > So yeah, that's what I said that this would have to be discussed
> > further. I agree that returning func/line info, program dump, etc is
> > probably a privileged part. But there is plenty of useful info besides
> > that (e.g., prog name, insns cnt, run stats, etc) that would be useful
> > for unpriv applications to monitor their own apps that they opened
> > from BPF FS, or just some observability daemons.
> >
> > There is a lot of useful information in bpf_map_info and bpf_link_info
> > that's way less privileged. I think bpf_link_info is good as is. Same
> > for bpf_map_info.
> >
> > Either way, I'm not insisting, just something that seems pretty simple
> > to add and useful in some scenarios. We can reuse existing code and
> > types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
> > those privileged fields you mentioned. Anyway, something to put on the
> > backburner, perhaps.
>
> Sorry, but I only see negatives. It's an extra code in the kernel
> that has to be carefully reviewed when initially submitted and
> then every patch that touches get_info_by_id would have to go
> through a microscope every time to avoid introducing a security issue.
> And for what? So that CAP_BPF application can read prog name and run stats?

Per my experience, observability is a very important part for a
project. If the user can't observe the object directly created by it,
he will worry about or even mistrust it.
However I don't insist on it either if you think we shouldn't do it.

Alexei Starovoitov April 7, 2023, 4:31 p.m. UTC | #32

On Fri, Apr 7, 2023 at 9:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > > > required to run bpftool,  so the bpftool running in the container
> > > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > > > Is there something that I missed ?
> > > > > > > > >
> > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > > > >
> > > > > > > This is not a task that is worth solving.
> > > > > > >
> > > > > >
> > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > > > >
> > > > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > > > way more thought).
> > > > >
> > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > > > you to take over someone else's link and stuff like this. But just
> > > > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > > > >
> > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > > > for a while now, and that seems useful in itself, as it would indeed
> > > > > help tools like bpftool to get *some* information even without
> > > > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > > > would be up to discussion, but I think getting some info without
> > > > > creating an FD seems useful in itself.
> > > > >
> > > > > Would it be worth discussing and solving this separately from
> > > > > namespacing issues?
> > > >
> > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > > > but GET_NEXT_BY_ID has to be carefully restricted.
> > > > It returns xlated, jited, BTF, line info, etc
> > > > and with all the restrictions it would need something like
> > > > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > > > And with that we're not far from CAP_SYS_ADMIN.
> > > > Why bother then?
> > >
> > > You probably meant that GET_INFO_BY_ID should be carefully restricted?
> >
> > yes.
> >
> > > So yeah, that's what I said that this would have to be discussed
> > > further. I agree that returning func/line info, program dump, etc is
> > > probably a privileged part. But there is plenty of useful info besides
> > > that (e.g., prog name, insns cnt, run stats, etc) that would be useful
> > > for unpriv applications to monitor their own apps that they opened
> > > from BPF FS, or just some observability daemons.
> > >
> > > There is a lot of useful information in bpf_map_info and bpf_link_info
> > > that's way less privileged. I think bpf_link_info is good as is. Same
> > > for bpf_map_info.
> > >
> > > Either way, I'm not insisting, just something that seems pretty simple
> > > to add and useful in some scenarios. We can reuse existing code and
> > > types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
> > > those privileged fields you mentioned. Anyway, something to put on the
> > > backburner, perhaps.
> >
> > Sorry, but I only see negatives. It's an extra code in the kernel
> > that has to be carefully reviewed when initially submitted and
> > then every patch that touches get_info_by_id would have to go
> > through a microscope every time to avoid introducing a security issue.
> > And for what? So that CAP_BPF application can read prog name and run stats?
>
> Per my experience, observability is a very important part for a
> project. If the user can't observe the object directly created by it,
> he will worry about or even mistrust it.

The user can observe the objects just fine. That's what get_info_by_fd is for.
But the kernel will not report JITed instructions to unpriv user who
just loaded a prog and a sole owner of it.
By your definition such a user should not trust the kernel. So be it.

Yafang Shao April 7, 2023, 4:35 p.m. UTC | #33

On Sat, Apr 8, 2023 at 12:32 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Apr 7, 2023 at 9:22 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Apr 8, 2023 at 12:05 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Apr 7, 2023 at 8:59 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Thu, Apr 6, 2023 at 6:44 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 06, 2023 at 01:22:26PM -0700, Andrii Nakryiko wrote:
> > > > > > On Wed, Apr 5, 2023 at 10:44 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 6, 2023 at 12:24 PM Alexei Starovoitov
> > > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Apr 5, 2023 at 8:22 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Apr 6, 2023 at 11:06 AM Alexei Starovoitov
> > > > > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 5, 2023 at 7:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > It seems that I didn't describe the issue clearly.
> > > > > > > > > > > The container doesn't have CAP_SYS_ADMIN, but the CAP_SYS_ADMIN is
> > > > > > > > > > > required to run bpftool,  so the bpftool running in the container
> > > > > > > > > > > can't get the ID of bpf objects or convert IDs to FDs.
> > > > > > > > > > > Is there something that I missed ?
> > > > > > > > > >
> > > > > > > > > > Nothing. This is by design. bpftool needs sudo. That's all.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Hmm, what I'm trying to do is make bpftool run without sudo.
> > > > > > > >
> > > > > > > > This is not a task that is worth solving.
> > > > > > > >
> > > > > > >
> > > > > > > Then the container with CAP_BPF enabled can't even iterate its bpf progs ...
> > > > > >
> > > > > > I'll leave the BPF namespace discussion aside (I agree that it needs
> > > > > > way more thought).
> > > > > >
> > > > > > I am a bit surprised that we require CAP_SYS_ADMIN for GET_NEXT_ID
> > > > > > operations. GET_FD_BY_ID is definitely CAP_SYS_ADMIN, as they allow
> > > > > > you to take over someone else's link and stuff like this. But just
> > > > > > iterating IDs seems like a pretty innocent functionality, so maybe we
> > > > > > should remove CAP_SYS_ADMIN for GET_NEXT_ID?
> > > > > >
> > > > > > By itself GET_NEXT_ID is relatively useless without capabilities, but
> > > > > > we've been floating the idea of providing GET_INFO_BY_ID (not by FD)
> > > > > > for a while now, and that seems useful in itself, as it would indeed
> > > > > > help tools like bpftool to get *some* information even without
> > > > > > privileges. Whether those GET_INFO_BY_ID operations should return same
> > > > > > full bpf_{prog,map,link,btf}_info or some trimmed down version of them
> > > > > > would be up to discussion, but I think getting some info without
> > > > > > creating an FD seems useful in itself.
> > > > > >
> > > > > > Would it be worth discussing and solving this separately from
> > > > > > namespacing issues?
> > > > >
> > > > > Iteration of IDs itself is fine. The set of IDs is not security sensitive,
> > > > > but GET_NEXT_BY_ID has to be carefully restricted.
> > > > > It returns xlated, jited, BTF, line info, etc
> > > > > and with all the restrictions it would need something like
> > > > > CAP_SYS_PTRACE and CAP_PERFMON to be useful.
> > > > > And with that we're not far from CAP_SYS_ADMIN.
> > > > > Why bother then?
> > > >
> > > > You probably meant that GET_INFO_BY_ID should be carefully restricted?
> > >
> > > yes.
> > >
> > > > So yeah, that's what I said that this would have to be discussed
> > > > further. I agree that returning func/line info, program dump, etc is
> > > > probably a privileged part. But there is plenty of useful info besides
> > > > that (e.g., prog name, insns cnt, run stats, etc) that would be useful
> > > > for unpriv applications to monitor their own apps that they opened
> > > > from BPF FS, or just some observability daemons.
> > > >
> > > > There is a lot of useful information in bpf_map_info and bpf_link_info
> > > > that's way less privileged. I think bpf_link_info is good as is. Same
> > > > for bpf_map_info.
> > > >
> > > > Either way, I'm not insisting, just something that seems pretty simple
> > > > to add and useful in some scenarios. We can reuse existing code and
> > > > types for GET_INFO_BY_FD and just zero-out (or prevent filling out)
> > > > those privileged fields you mentioned. Anyway, something to put on the
> > > > backburner, perhaps.
> > >
> > > Sorry, but I only see negatives. It's an extra code in the kernel
> > > that has to be carefully reviewed when initially submitted and
> > > then every patch that touches get_info_by_id would have to go
> > > through a microscope every time to avoid introducing a security issue.
> > > And for what? So that CAP_BPF application can read prog name and run stats?
> >
> > Per my experience, observability is a very important part for a
> > project. If the user can't observe the object directly created by it,
> > he will worry about or even mistrust it.
>
> The user can observe the objects just fine. That's what get_info_by_fd is for.
> But the kernel will not report JITed instructions to unpriv user who
> just loaded a prog and a sole owner of it.

There's no UAPI to create the JITed instructions directly per my
understanding. The JITed instructions are created by the kernel.
While they're really UAPI to create a map, prog, and link.

> By your definition such a user should not trust the kernel. So be it.

[RFC,bpf-next,00/13] bpf: Introduce BPF namespace

Message

Comments