mbox series

[RESEND,v3,bpf-next,00/14] BPF token

Message ID 20230629051832.897119-1-andrii@kernel.org (mailing list archive)
Headers show
Series BPF token | expand

Message

Andrii Nakryiko June 29, 2023, 5:18 a.m. UTC
This patch set introduces new BPF object, BPF token, which allows to delegate
a subset of BPF functionality from privileged system-wide daemon (e.g.,
systemd or any other container manager) to a *trusted* unprivileged
application. Trust is the key here. This functionality is not about allowing
unconditional unprivileged BPF usage. Establishing trust, though, is
completely up to the discretion of respective privileged application that
would create a BPF token, as different production setups can and do achieve it
through a combination of different means (signing, LSM, code reviews, etc),
and it's undesirable and infeasible for kernel to enforce any particular way
of validating trustworthiness of particular process.

The main motivation for BPF token is a desire to enable containerized
BPF applications to be used together with user namespaces. This is currently
impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
arbitrary memory, and it's impossible to ensure that they only read memory of
processes belonging to any given namespace. This means that it's impossible to
have namespace-aware CAP_BPF capability, and as such another mechanism to
allow safe usage of BPF functionality is necessary. BPF token and delegation
of it to a trusted unprivileged applications is such mechanism. Kernel makes
no assumption about what "trusted" constitutes in any particular case, and
it's up to specific privileged applications and their surrounding
infrastructure to decide that. What kernel provides is a set of APIs to create
and tune BPF token, and pass it around to privileged BPF commands that are
creating new BPF objects like BPF programs, BPF maps, etc.

Previous attempt at addressing this very same problem ([0]) attempted to
utilize authoritative LSM approach, but was conclusively rejected by upstream
LSM maintainers. BPF token concept is not changing anything about LSM
approach, but can be combined with LSM hooks for very fine-grained security
policy. Some ideas about making BPF token more convenient to use with LSM (in
particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
2023 presentation ([1]). E.g., an ability to specify user-provided data
(context), which in combination with BPF LSM would allow implementing a very
dynamic and fine-granular custom security policies on top of BPF token. In the
interest of minimizing API surface area discussions this is going to be
added in follow up patches, as it's not essential to the fundamental concept
of delegatable BPF token.

It should be noted that BPF token is conceptually quite similar to the idea of
/dev/bpf device file, proposed by Song a while ago ([2]). The biggest
difference is the idea of using virtual anon_inode file to hold BPF token and
allowing multiple independent instances of them, each with its own set of
restrictions. BPF pinning solves the problem of exposing such BPF token
through file system (BPF FS, in this case) for cases where transferring FDs
over Unix domain sockets is not convenient. And also, crucially, BPF token
approach is not using any special stateful task-scoped flags. Instead, bpf()
syscall accepts token_fd parameters explicitly for each relevant BPF command.
This addresses main concerns brought up during the /dev/bpf discussion, and
fits better with overall BPF subsystem design.

This patch set adds a basic minimum of functionality to make BPF token useful
and to discuss API and functionality. Currently only low-level libbpf APIs
support passing BPF token around, allowing to test kernel functionality, but
for the most part is not sufficient for real-world applications, which
typically use high-level libbpf APIs based on `struct bpf_object` type. This
was done with the intent to limit the size of patch set and concentrate on
mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
as a separate follow up patch set kernel support makes it upstream.

Another part that should happen once kernel-side BPF token is established, is
a set of conventions between applications (e.g., systemd), tools (e.g.,
bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
at well-defined locations to allow applications take advantage of this in
automatic fashion without explicit code changes on BPF application's side.
But I'd like to postpone this discussion to after BPF token concept lands.

Once important distinctions from v2 that should be noted is a chance in the
semantics of a newly added BPF_TOKEN_CREATE command. Previously,
BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
token object creation *and* pinning in BPF FS. Such change ensures that BPF
token is always associated with a specific instance of BPF FS and cannot
"escape" it by application re-pinning it somewhere else using another
BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
better containing it inside intended container (under assumption BPF FS is set
up in such a way as to not be shared with other containers on the system).

  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/

v3->v3-resend:
  - I started integrating token_fd into bpf_object_open_opts and higher-level
    libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
    implementation details and how libbpf performs feature detection and
    caching, so I decided to keep it separate from this patch set and not
    distract from the mostly kernel-side changes;
v2->v3:
  - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
    BPF_OBJ_PIN for BPF token;
v1->v2:
  - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
  - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).

Andrii Nakryiko (14):
  bpf: introduce BPF token object
  libbpf: add bpf_token_create() API
  selftests/bpf: add BPF_TOKEN_CREATE test
  bpf: add BPF token support to BPF_MAP_CREATE command
  libbpf: add BPF token support to bpf_map_create() API
  selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
  bpf: add BPF token support to BPF_BTF_LOAD command
  libbpf: add BPF token support to bpf_btf_load() API
  selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
  bpf: add BPF token support to BPF_PROG_LOAD command
  bpf: take into account BPF token when fetching helper protos
  bpf: consistenly use BPF token throughout BPF verifier logic
  libbpf: add BPF token support to bpf_prog_load() API
  selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests

 drivers/media/rc/bpf-lirc.c                   |   2 +-
 include/linux/bpf.h                           |  79 ++++-
 include/linux/filter.h                        |   2 +-
 include/uapi/linux/bpf.h                      |  53 ++++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/arraymap.c                         |   2 +-
 kernel/bpf/cgroup.c                           |   6 +-
 kernel/bpf/core.c                             |   3 +-
 kernel/bpf/helpers.c                          |   6 +-
 kernel/bpf/inode.c                            |  46 ++-
 kernel/bpf/syscall.c                          | 183 +++++++++---
 kernel/bpf/token.c                            | 201 +++++++++++++
 kernel/bpf/verifier.c                         |  13 +-
 kernel/trace/bpf_trace.c                      |   2 +-
 net/core/filter.c                             |  36 +--
 net/ipv4/bpf_tcp_ca.c                         |   2 +-
 net/netfilter/nf_bpf_link.c                   |   2 +-
 tools/include/uapi/linux/bpf.h                |  53 ++++
 tools/lib/bpf/bpf.c                           |  35 ++-
 tools/lib/bpf/bpf.h                           |  45 ++-
 tools/lib/bpf/libbpf.map                      |   1 +
 .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
 .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
 24 files changed, 957 insertions(+), 104 deletions(-)
 create mode 100644 kernel/bpf/token.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c

Comments

Toke Høiland-Jørgensen June 29, 2023, 11:15 p.m. UTC | #1
Andrii Nakryiko <andrii@kernel.org> writes:

> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token, as different production setups can and do achieve it
> through a combination of different means (signing, LSM, code reviews, etc),
> and it's undesirable and infeasible for kernel to enforce any particular way
> of validating trustworthiness of particular process.
>
> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.

So a colleague pointed out today that the Seccomp Notify functionality
would be a way to achieve your stated goal of allowing unprivileged
containers to (selectively) perform bpf() syscall operations. Christian
Brauner has a pretty nice writeup of the functionality here:
https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development

In fact he even mentions allowing unprivileged access to bpf() as a
possible use case (in the second-to-last paragraph).

AFAICT this would enable your use case without adding any new kernel
functionality or changing the BPF-using applications, while allowing the
privileged userspace daemon to make case-by-case decisions on each
operation instead of granting blanket capabilities (which is my main
objection to the token proposal, as we discussed on the last iteration
of the series).

So I'm curious whether you considered this as an alternative to
BPF_TOKEN? And if so, what your reason was for rejecting it?

-Toke
Andrii Nakryiko June 30, 2023, 6:25 p.m. UTC | #2
On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii@kernel.org> writes:
>
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token, as different production setups can and do achieve it
> > through a combination of different means (signing, LSM, code reviews, etc),
> > and it's undesirable and infeasible for kernel to enforce any particular way
> > of validating trustworthiness of particular process.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> So a colleague pointed out today that the Seccomp Notify functionality
> would be a way to achieve your stated goal of allowing unprivileged
> containers to (selectively) perform bpf() syscall operations. Christian
> Brauner has a pretty nice writeup of the functionality here:
> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>
> In fact he even mentions allowing unprivileged access to bpf() as a
> possible use case (in the second-to-last paragraph).
>
> AFAICT this would enable your use case without adding any new kernel
> functionality or changing the BPF-using applications, while allowing the
> privileged userspace daemon to make case-by-case decisions on each
> operation instead of granting blanket capabilities (which is my main
> objection to the token proposal, as we discussed on the last iteration
> of the series).

It's not "blanket" capabilities. You control types or maps and
programs that could be created. And again, CAP_SYS_ADMIN guarded.
Please, don't give CAP_SYS_ADMIN/root permissions to applications you
can't be sure won't do something stupid and blame kernel API for it.

After all, the root process can setuid() any file and make it run with
elevated permissions, right? Doesn't get more "blanket" than that.

>
> So I'm curious whether you considered this as an alternative to
> BPF_TOKEN? And if so, what your reason was for rejecting it?
>

Yes, I'm aware, Christian has a follow up short blog post specifically
for using this for proxying BPF from privileged process ([0]).

So, in short, I think it's not a good generic solution. It's very
fragile and high-maintenance. It's still proxying BPF UAPI (except
application does preserve illusion of using BPF syscall, yes, that
part is good) with all the implications: needing to replicate all of
UAPI (fetching all those FDs from another process, following all the
pointers from another process' memory, etc), and also writing back all
the correct things (into another process' memory): log content,
log_true_size (out param), any other output parameters. What do we do
when an application uses a newer version of bpf_attr that is supported
by proxy? And honestly, I'm like 99% sure there are lots of less
obvious issues one runs into when starting implementing something like
this.

This sounds like a hack and nightmare to implement and support.
Perhaps that indirectly is supported by the fact that even Christian
half-jokingly calls this a crazy approach. That code basically is
unchanged for the last three years, with only one fix from Christian
one year after initial introduction ([1]) to fix a quirky issue
related to the limitation of pidfd working only for thread group
leaders. It also still supports only BPF_PROG_TYPE_CGROUP_DEVICE
program loading, it doesn't support a bunch of newer BPF_PROG_LOAD
fields and functionality, etc, etc.

So as a technical curiosity it's pretty cool and perhaps is the right
tool for the job for very narrow specific use cases. But as a
realistic generic approach that could be used by industry at large for
safe BPF usage from namespaced containers -- not so much.


  [0] https://brauner.io/2020/08/07/seccomp-notify-intercepting-the-bpf-syscall.html
  [1] https://github.com/lxc/lxd/commit/566d0a3b3cbe288787886c2f3bf5b250ceb930b0


> -Toke
>
Yafang Shao July 1, 2023, 2:05 a.m. UTC | #3
On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token, as different production setups can and do achieve it
> through a combination of different means (signing, LSM, code reviews, etc),
> and it's undesirable and infeasible for kernel to enforce any particular way
> of validating trustworthiness of particular process.
>
> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.
>
> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area discussions this is going to be
> added in follow up patches, as it's not essential to the fundamental concept
> of delegatable BPF token.
>
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each with its own set of
> restrictions. BPF pinning solves the problem of exposing such BPF token
> through file system (BPF FS, in this case) for cases where transferring FDs
> over Unix domain sockets is not convenient. And also, crucially, BPF token
> approach is not using any special stateful task-scoped flags. Instead, bpf()
> syscall accepts token_fd parameters explicitly for each relevant BPF command.
> This addresses main concerns brought up during the /dev/bpf discussion, and
> fits better with overall BPF subsystem design.
>
> This patch set adds a basic minimum of functionality to make BPF token useful
> and to discuss API and functionality. Currently only low-level libbpf APIs
> support passing BPF token around, allowing to test kernel functionality, but
> for the most part is not sufficient for real-world applications, which
> typically use high-level libbpf APIs based on `struct bpf_object` type. This
> was done with the intent to limit the size of patch set and concentrate on
> mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> as a separate follow up patch set kernel support makes it upstream.
>
> Another part that should happen once kernel-side BPF token is established, is
> a set of conventions between applications (e.g., systemd), tools (e.g.,
> bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> at well-defined locations to allow applications take advantage of this in
> automatic fashion without explicit code changes on BPF application's side.
> But I'd like to postpone this discussion to after BPF token concept lands.
>
> Once important distinctions from v2 that should be noted is a chance in the
> semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> token object creation *and* pinning in BPF FS. Such change ensures that BPF
> token is always associated with a specific instance of BPF FS and cannot
> "escape" it by application re-pinning it somewhere else using another
> BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> better containing it inside intended container (under assumption BPF FS is set
> up in such a way as to not be shared with other containers on the system).
>
>   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
>   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
>   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
>
> v3->v3-resend:
>   - I started integrating token_fd into bpf_object_open_opts and higher-level
>     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
>     implementation details and how libbpf performs feature detection and
>     caching, so I decided to keep it separate from this patch set and not
>     distract from the mostly kernel-side changes;
> v2->v3:
>   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
>     BPF_OBJ_PIN for BPF token;
> v1->v2:
>   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
>   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
>
> Andrii Nakryiko (14):
>   bpf: introduce BPF token object
>   libbpf: add bpf_token_create() API
>   selftests/bpf: add BPF_TOKEN_CREATE test
>   bpf: add BPF token support to BPF_MAP_CREATE command
>   libbpf: add BPF token support to bpf_map_create() API
>   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
>   bpf: add BPF token support to BPF_BTF_LOAD command
>   libbpf: add BPF token support to bpf_btf_load() API
>   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
>   bpf: add BPF token support to BPF_PROG_LOAD command
>   bpf: take into account BPF token when fetching helper protos
>   bpf: consistenly use BPF token throughout BPF verifier logic
>   libbpf: add BPF token support to bpf_prog_load() API
>   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
>
>  drivers/media/rc/bpf-lirc.c                   |   2 +-
>  include/linux/bpf.h                           |  79 ++++-
>  include/linux/filter.h                        |   2 +-
>  include/uapi/linux/bpf.h                      |  53 ++++
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/arraymap.c                         |   2 +-
>  kernel/bpf/cgroup.c                           |   6 +-
>  kernel/bpf/core.c                             |   3 +-
>  kernel/bpf/helpers.c                          |   6 +-
>  kernel/bpf/inode.c                            |  46 ++-
>  kernel/bpf/syscall.c                          | 183 +++++++++---
>  kernel/bpf/token.c                            | 201 +++++++++++++
>  kernel/bpf/verifier.c                         |  13 +-
>  kernel/trace/bpf_trace.c                      |   2 +-
>  net/core/filter.c                             |  36 +--
>  net/ipv4/bpf_tcp_ca.c                         |   2 +-
>  net/netfilter/nf_bpf_link.c                   |   2 +-
>  tools/include/uapi/linux/bpf.h                |  53 ++++
>  tools/lib/bpf/bpf.c                           |  35 ++-
>  tools/lib/bpf/bpf.h                           |  45 ++-
>  tools/lib/bpf/libbpf.map                      |   1 +
>  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
>  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
>  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
>  24 files changed, 957 insertions(+), 104 deletions(-)
>  create mode 100644 kernel/bpf/token.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
>
> --
> 2.34.1
>
>


Hi Andrii,

Thanks for your proposal.
That seems to be a useful functionality, and yet I have some questions.

1. Why can't we add security_bpf_probe_read_{kernel,user}?
    If possible, we can use these LSM hooks to refuse the process to
read other tasks' information. E.g. if the other process is not within
the same cgroup or the same namespace, we just refuse the reading. I
think it is not hard to identify if the other process is within the
same cgroup or the same namespace.

2. Why can't we extend bpf_cookie?
   We're now using bpf_cookie to identify each user or each
application, and only the permitted cookies can create new probe
links.  However we find the bpf_cookie is only supported by tracing,
perf_event and kprobe_multi, so we're planning to extend it to other
possible link types, then we can use LSM hooks to control all bpf
links.  I think that the upstream kernel should also support
bpf_cookie for all bpf links. If possible, we will post it to the
upstream in the future.
   After I have read your BPF token proposal, I just have some other
ideas. Why can't we just extend bpf_cookie to all other BPF objects?
For example, all progs and maps should also have the bpf_cookie.
Djalal Harouni July 2, 2023, 6:59 a.m. UTC | #4
On Fri, Jun 30, 2023 at 1:16 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii@kernel.org> writes:
>
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token, as different production setups can and do achieve it
> > through a combination of different means (signing, LSM, code reviews, etc),
> > and it's undesirable and infeasible for kernel to enforce any particular way
> > of validating trustworthiness of particular process.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> So a colleague pointed out today that the Seccomp Notify functionality
> would be a way to achieve your stated goal of allowing unprivileged
> containers to (selectively) perform bpf() syscall operations. Christian
> Brauner has a pretty nice writeup of the functionality here:
> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>
> In fact he even mentions allowing unprivileged access to bpf() as a
> possible use case (in the second-to-last paragraph).
>
> AFAICT this would enable your use case without adding any new kernel
> functionality or changing the BPF-using applications, while allowing the
> privileged userspace daemon to make case-by-case decisions on each
> operation instead of granting blanket capabilities (which is my main
> objection to the token proposal, as we discussed on the last iteration
> of the series).
>
> So I'm curious whether you considered this as an alternative to
> BPF_TOKEN? And if so, what your reason was for rejecting it?

The Seccomp notifier is an answer 1. to special device nodes (or
arguably to simple cases...) , 2. a quick solution without changing
infrastructure and how the kernel deals with device nodes (doesn't
solve the root problem where this BPF series at least tries...), 3.
relies on Seccomp and would inherit its same limitation.

It clashes with BPF! BPF is not mknod, and most of its use cases are
*transparent to the workload*, they can't use Seccomp and are not
interested in it... Fd delegation is good design and applies to *all*
BPF use cases, all tools can take advantage of it, it is not
restricted to a special tool or daemon X.

Going further, hiding behind Seccomp notifier and such prevents BPF
from solving current and future problems.
Christian Brauner July 4, 2023, 9:38 a.m. UTC | #5
On Fri, Jun 30, 2023 at 11:25:57AM -0700, Andrii Nakryiko wrote:
> On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Andrii Nakryiko <andrii@kernel.org> writes:
> >
> > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > systemd or any other container manager) to a *trusted* unprivileged
> > > application. Trust is the key here. This functionality is not about allowing
> > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > completely up to the discretion of respective privileged application that
> > > would create a BPF token, as different production setups can and do achieve it
> > > through a combination of different means (signing, LSM, code reviews, etc),
> > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > of validating trustworthiness of particular process.
> > >
> > > The main motivation for BPF token is a desire to enable containerized
> > > BPF applications to be used together with user namespaces. This is currently
> > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > processes belonging to any given namespace. This means that it's impossible to
> > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > no assumption about what "trusted" constitutes in any particular case, and
> > > it's up to specific privileged applications and their surrounding
> > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > So a colleague pointed out today that the Seccomp Notify functionality
> > would be a way to achieve your stated goal of allowing unprivileged
> > containers to (selectively) perform bpf() syscall operations. Christian
> > Brauner has a pretty nice writeup of the functionality here:
> > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
> >
> > In fact he even mentions allowing unprivileged access to bpf() as a
> > possible use case (in the second-to-last paragraph).
> >
> > AFAICT this would enable your use case without adding any new kernel
> > functionality or changing the BPF-using applications, while allowing the
> > privileged userspace daemon to make case-by-case decisions on each
> > operation instead of granting blanket capabilities (which is my main
> > objection to the token proposal, as we discussed on the last iteration
> > of the series).
> 
> It's not "blanket" capabilities. You control types or maps and
> programs that could be created. And again, CAP_SYS_ADMIN guarded.
> Please, don't give CAP_SYS_ADMIN/root permissions to applications you
> can't be sure won't do something stupid and blame kernel API for it.
> 
> After all, the root process can setuid() any file and make it run with
> elevated permissions, right? Doesn't get more "blanket" than that.
> 
> >
> > So I'm curious whether you considered this as an alternative to
> > BPF_TOKEN? And if so, what your reason was for rejecting it?
> >
> 
> Yes, I'm aware, Christian has a follow up short blog post specifically
> for using this for proxying BPF from privileged process ([0]).
> 
> So, in short, I think it's not a good generic solution. It's very
> fragile and high-maintenance. It's still proxying BPF UAPI (except
> application does preserve illusion of using BPF syscall, yes, that
> part is good) with all the implications: needing to replicate all of
> UAPI (fetching all those FDs from another process, following all the
> pointers from another process' memory, etc), and also writing back all
> the correct things (into another process' memory): log content,
> log_true_size (out param), any other output parameters. What do we do
> when an application uses a newer version of bpf_attr that is supported
> by proxy? And honestly, I'm like 99% sure there are lots of less
> obvious issues one runs into when starting implementing something like
> this.
> 
> This sounds like a hack and nightmare to implement and support.
> Perhaps that indirectly is supported by the fact that even Christian
> half-jokingly calls this a crazy approach. That code basically is
> unchanged for the last three years, with only one fix from Christian
> one year after initial introduction ([1]) to fix a quirky issue
> related to the limitation of pidfd working only for thread group
> leaders. It also still supports only BPF_PROG_TYPE_CGROUP_DEVICE
> program loading, it doesn't support a bunch of newer BPF_PROG_LOAD
> fields and functionality, etc, etc.
> 
> So as a technical curiosity it's pretty cool and perhaps is the right
> tool for the job for very narrow specific use cases. But as a
> realistic generic approach that could be used by industry at large for
> safe BPF usage from namespaced containers -- not so much.

Some background... When BPF & cgroup moved the devices cgroup from a
file-based cgroup controller into a BPF program it was technically an
immediate widespread regression.

The cgroup v1 controller was file based and supported seemlessly
switching between allow- and denylists. Whether that was ever sensible
is a separate question.

But what this meant was that any container runtime that used a simple
file-based mechanism now had to generate a BPF device program that
mirrored the cgroup v1 semantic such that the old syntax of the cgroup
v1 device controller would be correctly translated into a BPF devices
program.

In addition, this broke some nesting scenarios. So intercepting bpf()
via seccomp was specifically done to avoid devices cgroup regressions.
It was never meant to be a generic solution.

It also doesn't work for all cases as the seccomp notifier's supervision
mechanism isn't really a clean solution.

It's a pipe dream that you can transparently proxy system calls for
another process via seccomp for sufficiently complex system calls. We
did it for specific use-cases where we could sufficiently guarantee that
they could be safe. But to make this work it would involve way more
invasive changes:

* nesting/stacking of seccomp notifiers
* clean handling of pointer arguments in-kernel such that you can safely
  continue system calls being sure that they haven't been modified. This
  is currently only possible in scenarios where safety is guaranteed by
  the kernel refusing nonsensical or unsafe arguments
* correct privilege handling
  The seccomp notifier emulates system calls in userspace and thus has
  to mimick the privilege context of the task it is emulating the system
  call for in such a way that (i) it allows it to succeed by avoiding the
  privilege limitations of why the given system call was supposed to be
  proxied in the first place, (ii) it doesn't allow to circumvent other,
  generic restrictions that would otherwise cause the system call to
  fail. It's like saying e.g., "execute with most of the proxied task's
  creds but let it have a few more privileges". That's frail as Linux
  creds aren't really composable. That's why we have override_creds()
  not "add_creds()" and "subtract_creds()" which would probably be
  nicer.

Or it would have to be a generic first class kernel proxy which begs the
question why not change the subsystems itself to do this cleanly.
Christian Brauner July 4, 2023, 9:51 a.m. UTC | #6
On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote:
> Andrii Nakryiko <andrii@kernel.org> writes:
> 
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token, as different production setups can and do achieve it
> > through a combination of different means (signing, LSM, code reviews, etc),
> > and it's undesirable and infeasible for kernel to enforce any particular way
> > of validating trustworthiness of particular process.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
> 
> So a colleague pointed out today that the Seccomp Notify functionality
> would be a way to achieve your stated goal of allowing unprivileged
> containers to (selectively) perform bpf() syscall operations. Christian
> Brauner has a pretty nice writeup of the functionality here:
> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development

I'm amazed you read this. :)
The seccomp notifier comes with a lot of caveats. I think it would be
impractical if not infeasible to handle bpf() delegation.

> 
> In fact he even mentions allowing unprivileged access to bpf() as a
> possible use case (in the second-to-last paragraph).

Yeah, I tried to work around a userspace regression with the
introduction of the cgroup v2 devices controller.
Toke Høiland-Jørgensen July 4, 2023, 11:20 p.m. UTC | #7
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii@kernel.org> writes:
>>
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token, as different production setups can and do achieve it
>> > through a combination of different means (signing, LSM, code reviews, etc),
>> > and it's undesirable and infeasible for kernel to enforce any particular way
>> > of validating trustworthiness of particular process.
>> >
>> > The main motivation for BPF token is a desire to enable containerized
>> > BPF applications to be used together with user namespaces. This is currently
>> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
>> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
>> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
>> > arbitrary memory, and it's impossible to ensure that they only read memory of
>> > processes belonging to any given namespace. This means that it's impossible to
>> > have namespace-aware CAP_BPF capability, and as such another mechanism to
>> > allow safe usage of BPF functionality is necessary. BPF token and delegation
>> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
>> > no assumption about what "trusted" constitutes in any particular case, and
>> > it's up to specific privileged applications and their surrounding
>> > infrastructure to decide that. What kernel provides is a set of APIs to create
>> > and tune BPF token, and pass it around to privileged BPF commands that are
>> > creating new BPF objects like BPF programs, BPF maps, etc.
>>
>> So a colleague pointed out today that the Seccomp Notify functionality
>> would be a way to achieve your stated goal of allowing unprivileged
>> containers to (selectively) perform bpf() syscall operations. Christian
>> Brauner has a pretty nice writeup of the functionality here:
>> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>>
>> In fact he even mentions allowing unprivileged access to bpf() as a
>> possible use case (in the second-to-last paragraph).
>>
>> AFAICT this would enable your use case without adding any new kernel
>> functionality or changing the BPF-using applications, while allowing the
>> privileged userspace daemon to make case-by-case decisions on each
>> operation instead of granting blanket capabilities (which is my main
>> objection to the token proposal, as we discussed on the last iteration
>> of the series).
>
> It's not "blanket" capabilities. You control types or maps and
> programs that could be created. And again, CAP_SYS_ADMIN guarded.
> Please, don't give CAP_SYS_ADMIN/root permissions to applications you
> can't be sure won't do something stupid and blame kernel API for it.

Right, I didn't mean "blanket" in the sense of "permission to do
anything on the system"; I do get that you can restrict which subset of
functionality you grant. However, *within* that subset, it's a blanket
permission grant. I.e., you can't issue a token that grants a *specific*
application permission to load a *specific* BPF program - you can only
grant a general "load any program" permission that can be used by anyone
who possesses the token.

I guess we could in principle extend the token mechanism to allow this,
but the kernel doesn't seem like the right place to implement such a
fine-grained policy engine...

> After all, the root process can setuid() any file and make it run with
> elevated permissions, right? Doesn't get more "blanket" than that.

Which is exactly why setuid binaries are not generally how we implement
security delegation these days. So I don't think designing a new
mechanism this way is a good idea.

>> So I'm curious whether you considered this as an alternative to
>> BPF_TOKEN? And if so, what your reason was for rejecting it?
>>
>
> Yes, I'm aware, Christian has a follow up short blog post specifically
> for using this for proxying BPF from privileged process ([0]).
>
> So, in short, I think it's not a good generic solution. It's very
> fragile and high-maintenance. It's still proxying BPF UAPI (except
> application does preserve illusion of using BPF syscall, yes, that
> part is good) with all the implications: needing to replicate all of
> UAPI (fetching all those FDs from another process, following all the
> pointers from another process' memory, etc), and also writing back all
> the correct things (into another process' memory): log content,
> log_true_size (out param), any other output parameters.

Right, OK, that bit does sound pretty tedious (although I'll note that
there are people who are trying to make all this generally more
palatable[0]).

However, all that tediousness could be avoided while still retaining the
model of blocking the syscall and asking a userspace policy daemon to
supply a verdict. This could even be done using the same token
mechanism: instead of attaching a permission to the token itself, just
make it an opaque identifier. Then, when a syscall is made that contains
the token, block it and send a notification to user space and use the
verdict that comes back in place of the token "value". The notification
could go through the same file descriptor (using read/write or an ioctl,
restricted to CAP_SYS_ADMIN), or it could be a separate one that is
returned alongside it on TOKEN_CREATE. The notification could include
all of the syscall args or a subset, depending on the command, but the
kernel can ensure there are no TOCTOU races, and no need for the policy
daemon to go poking into other another process' namespace.

Actually, using this model I don't think we would even strictly speaking
need the explicit token FD to be included by the calling application
inside the container at all? I.e., if the system policy daemon could
just instruct the kernel "please delegate all permission decisions for
this user namespace to me", it could - so to speak - issue tokens on
demand as each call is made, instead of ahead of time. Which would both
enable the policy daemon to make specific usage decisions, and wouldn't
require any change needed to the applications using BPF inside the
container (not even to include the BPF token FD).

-Toke
Toke Høiland-Jørgensen July 4, 2023, 11:33 p.m. UTC | #8
Christian Brauner <brauner@kernel.org> writes:

> On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote:
>> Andrii Nakryiko <andrii@kernel.org> writes:
>> 
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token, as different production setups can and do achieve it
>> > through a combination of different means (signing, LSM, code reviews, etc),
>> > and it's undesirable and infeasible for kernel to enforce any particular way
>> > of validating trustworthiness of particular process.
>> >
>> > The main motivation for BPF token is a desire to enable containerized
>> > BPF applications to be used together with user namespaces. This is currently
>> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
>> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
>> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
>> > arbitrary memory, and it's impossible to ensure that they only read memory of
>> > processes belonging to any given namespace. This means that it's impossible to
>> > have namespace-aware CAP_BPF capability, and as such another mechanism to
>> > allow safe usage of BPF functionality is necessary. BPF token and delegation
>> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
>> > no assumption about what "trusted" constitutes in any particular case, and
>> > it's up to specific privileged applications and their surrounding
>> > infrastructure to decide that. What kernel provides is a set of APIs to create
>> > and tune BPF token, and pass it around to privileged BPF commands that are
>> > creating new BPF objects like BPF programs, BPF maps, etc.
>> 
>> So a colleague pointed out today that the Seccomp Notify functionality
>> would be a way to achieve your stated goal of allowing unprivileged
>> containers to (selectively) perform bpf() syscall operations. Christian
>> Brauner has a pretty nice writeup of the functionality here:
>> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>
> I'm amazed you read this. :)

I found it quite an enjoyable read, actually :)

> The seccomp notifier comes with a lot of caveats. I think it would be
> impractical if not infeasible to handle bpf() delegation.

Right, thank you for chiming in and explaining the context. I replied
elsewhere in the thread on the content, so let's not fork the discussion
any more than we have to...

-Toke
Stefano Brivio July 5, 2023, 12:57 p.m. UTC | #9
On Wed, 05 Jul 2023 01:20:22 +0200
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> 
> > On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:  
> >>
> >> Andrii Nakryiko <andrii@kernel.org> writes:
> >>  
> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> > application. Trust is the key here. This functionality is not about allowing
> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> > completely up to the discretion of respective privileged application that
> >> > would create a BPF token, as different production setups can and do achieve it
> >> > through a combination of different means (signing, LSM, code reviews, etc),
> >> > and it's undesirable and infeasible for kernel to enforce any particular way
> >> > of validating trustworthiness of particular process.
> >> >
> >> > The main motivation for BPF token is a desire to enable containerized
> >> > BPF applications to be used together with user namespaces. This is currently
> >> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> >> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> >> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> >> > arbitrary memory, and it's impossible to ensure that they only read memory of
> >> > processes belonging to any given namespace. This means that it's impossible to
> >> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> >> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> >> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> >> > no assumption about what "trusted" constitutes in any particular case, and
> >> > it's up to specific privileged applications and their surrounding
> >> > infrastructure to decide that. What kernel provides is a set of APIs to create
> >> > and tune BPF token, and pass it around to privileged BPF commands that are
> >> > creating new BPF objects like BPF programs, BPF maps, etc.  
> >>
> >> So a colleague pointed out today that the Seccomp Notify functionality
> >> would be a way to achieve your stated goal of allowing unprivileged
> >> containers to (selectively) perform bpf() syscall operations. Christian
> >> Brauner has a pretty nice writeup of the functionality here:
> >> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
> >>
> >> In fact he even mentions allowing unprivileged access to bpf() as a
> >> possible use case (in the second-to-last paragraph).
> >>
> >> AFAICT this would enable your use case without adding any new kernel
> >> functionality or changing the BPF-using applications, while allowing the
> >> privileged userspace daemon to make case-by-case decisions on each
> >> operation instead of granting blanket capabilities (which is my main
> >> objection to the token proposal, as we discussed on the last iteration
> >> of the series).  
> >
> > It's not "blanket" capabilities. You control types or maps and
> > programs that could be created. And again, CAP_SYS_ADMIN guarded.
> > Please, don't give CAP_SYS_ADMIN/root permissions to applications you
> > can't be sure won't do something stupid and blame kernel API for it.  
> 
> Right, I didn't mean "blanket" in the sense of "permission to do
> anything on the system"; I do get that you can restrict which subset of
> functionality you grant. However, *within* that subset, it's a blanket
> permission grant. I.e., you can't issue a token that grants a *specific*
> application permission to load a *specific* BPF program - you can only
> grant a general "load any program" permission that can be used by anyone
> who possesses the token.
> 
> I guess we could in principle extend the token mechanism to allow this,
> but the kernel doesn't seem like the right place to implement such a
> fine-grained policy engine...
> 
> > After all, the root process can setuid() any file and make it run with
> > elevated permissions, right? Doesn't get more "blanket" than that.  
> 
> Which is exactly why setuid binaries are not generally how we implement
> security delegation these days. So I don't think designing a new
> mechanism this way is a good idea.
> 
> >> So I'm curious whether you considered this as an alternative to
> >> BPF_TOKEN? And if so, what your reason was for rejecting it?
> >>  
> >
> > Yes, I'm aware, Christian has a follow up short blog post specifically
> > for using this for proxying BPF from privileged process ([0]).
> >
> > So, in short, I think it's not a good generic solution. It's very
> > fragile and high-maintenance. It's still proxying BPF UAPI (except
> > application does preserve illusion of using BPF syscall, yes, that
> > part is good) with all the implications: needing to replicate all of
> > UAPI (fetching all those FDs from another process, following all the
> > pointers from another process' memory, etc), and also writing back all
> > the correct things (into another process' memory): log content,
> > log_true_size (out param), any other output parameters.  
> 
> Right, OK, that bit does sound pretty tedious (although I'll note that
> there are people who are trying to make all this generally more
> palatable[0]).

[0] https://seitan.rocks/ :)

Some clickbaiting for Christian: the presentation we gave a couple of
weeks ago, also linked from the project website, actually credits you
(slide 29/30, of course).

The code is still very much draft quality (we mostly focused on
demos/feasibility so far, cleaning it up now), and we didn't prove (at
least not yet) that handling complicated stuff such as bpf(2) is
actually convenient, but that's at least in scope as a stretch goal.
I'm not claiming it's doable, but we'd give it a try.

What we have at the moment is a meagre set of eight syscall models,
some blatantly incomplete.

A couple of comments to specific points Christian mentioned:

On Tue, 4 Jul 2023 11:38:38 +0200
Christian Brauner <brauner@kernel.org> wrote:

> It's a pipe dream that you can transparently proxy system calls for
> another process via seccomp for sufficiently complex system calls. We
> did it for specific use-cases where we could sufficiently guarantee that
> they could be safe.

Right, so we're trying to pick it up from there. It's way too early to
claim success, but I thought it would make sense to chime in anyway.

> But to make this work it would involve way more invasive changes:
> 
> * nesting/stacking of seccomp notifiers

The need for stacked seccomp filters is obvious to me and that works more
or less naturally. But why would you actually need to stack, or especially
nest *notifiers* themselves?

> * clean handling of pointer arguments in-kernel such that you can safely
>   continue system calls being sure that they haven't been modified. This
>   is currently only possible in scenarios where safety is guaranteed by
>   the kernel refusing nonsensical or unsafe arguments

We're considering a couple of options. One is to never use
SECCOMP_USER_NOTIF_FLAG_CONTINUE for system calls accepting pointers, or
only allowing that as an explicit "unsafe" option. For a "safe"
implementation, the supervisor (seitan) would in any case replay the
system call, matching the context (namespaces, credentials) of the target
process.

If PID or TID (per se, not in terms of associated context/capabilities) of
the caller matter for a specific system call, though, we simply can't
support that. But that shouldn't actually be relevant for bpf(2).

Strictly speaking, I think it's actually possible to "fix" this in the
kernel by means of checking or copying memory that's addressable by a
thread, but that might prove too invasive or end up in insurmountable
layering violations. This mechanism would involve "control" paths
rather than data paths, though, so the performance impact is not really
worrying.

Another option, which we outlined at this very convenient link:
  https://github.com/alicefr/community/blob/seitan/design-proposals/seitan/security-aspects-seitan.md#if-i-use-the-json-model-as-a-security-filter-can-another-thread-in-the-same-process-context-write-to-the-memory-area-pointed-to-by-system-call-arguments-while-the-calling-thread-is-blocked-and-defy-the-purpose-of-the-filter

would be to make the supervisor perform a deep copy (system calls are
anyway modeled in the seitan-cooker component) and then use good old
ptrace(2) as needed.

> * correct privilege handling
>   The seccomp notifier emulates system calls in userspace and thus has
>   to mimick the privilege context of the task it is emulating the system
>   call for in such a way that (i) it allows it to succeed by avoiding the
>   privilege limitations of why the given system call was supposed to be
>   proxied in the first place, (ii) it doesn't allow to circumvent other,
>   generic restrictions that would otherwise cause the system call to
>   fail. It's like saying e.g., "execute with most of the proxied task's
>   creds but let it have a few more privileges". That's frail as Linux
>   creds aren't really composable. That's why we have override_creds()
>   not "add_creds()" and "subtract_creds()" which would probably be
>   nicer.

Right, at the moment we just run that as root, but we plan to take care
of (ii) (albeit not solving it entirely, I guess), by at least applying a
seccomp filter to the supervisor itself. As to the set of (composed?)
capabilities, we don't have an answer yet.

> Or it would have to be a generic first class kernel proxy which begs the
> question why not change the subsystems itself to do this cleanly.

Well, the fine-grained "policy" implementation we're trying to achieve
looks to me like something that's a bit too complicated for the kernel,
and really more appropriate for userspace.
Andrii Nakryiko July 5, 2023, 8:37 p.m. UTC | #10
On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token, as different production setups can and do achieve it
> > through a combination of different means (signing, LSM, code reviews, etc),
> > and it's undesirable and infeasible for kernel to enforce any particular way
> > of validating trustworthiness of particular process.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > This addresses main concerns brought up during the /dev/bpf discussion, and
> > fits better with overall BPF subsystem design.
> >
> > This patch set adds a basic minimum of functionality to make BPF token useful
> > and to discuss API and functionality. Currently only low-level libbpf APIs
> > support passing BPF token around, allowing to test kernel functionality, but
> > for the most part is not sufficient for real-world applications, which
> > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > was done with the intent to limit the size of patch set and concentrate on
> > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > as a separate follow up patch set kernel support makes it upstream.
> >
> > Another part that should happen once kernel-side BPF token is established, is
> > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > at well-defined locations to allow applications take advantage of this in
> > automatic fashion without explicit code changes on BPF application's side.
> > But I'd like to postpone this discussion to after BPF token concept lands.
> >
> > Once important distinctions from v2 that should be noted is a chance in the
> > semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> > token object creation *and* pinning in BPF FS. Such change ensures that BPF
> > token is always associated with a specific instance of BPF FS and cannot
> > "escape" it by application re-pinning it somewhere else using another
> > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> > better containing it inside intended container (under assumption BPF FS is set
> > up in such a way as to not be shared with other containers on the system).
> >
> >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> >
> > v3->v3-resend:
> >   - I started integrating token_fd into bpf_object_open_opts and higher-level
> >     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
> >     implementation details and how libbpf performs feature detection and
> >     caching, so I decided to keep it separate from this patch set and not
> >     distract from the mostly kernel-side changes;
> > v2->v3:
> >   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
> >     BPF_OBJ_PIN for BPF token;
> > v1->v2:
> >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
> >
> > Andrii Nakryiko (14):
> >   bpf: introduce BPF token object
> >   libbpf: add bpf_token_create() API
> >   selftests/bpf: add BPF_TOKEN_CREATE test
> >   bpf: add BPF token support to BPF_MAP_CREATE command
> >   libbpf: add BPF token support to bpf_map_create() API
> >   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
> >   bpf: add BPF token support to BPF_BTF_LOAD command
> >   libbpf: add BPF token support to bpf_btf_load() API
> >   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
> >   bpf: add BPF token support to BPF_PROG_LOAD command
> >   bpf: take into account BPF token when fetching helper protos
> >   bpf: consistenly use BPF token throughout BPF verifier logic
> >   libbpf: add BPF token support to bpf_prog_load() API
> >   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
> >
> >  drivers/media/rc/bpf-lirc.c                   |   2 +-
> >  include/linux/bpf.h                           |  79 ++++-
> >  include/linux/filter.h                        |   2 +-
> >  include/uapi/linux/bpf.h                      |  53 ++++
> >  kernel/bpf/Makefile                           |   2 +-
> >  kernel/bpf/arraymap.c                         |   2 +-
> >  kernel/bpf/cgroup.c                           |   6 +-
> >  kernel/bpf/core.c                             |   3 +-
> >  kernel/bpf/helpers.c                          |   6 +-
> >  kernel/bpf/inode.c                            |  46 ++-
> >  kernel/bpf/syscall.c                          | 183 +++++++++---
> >  kernel/bpf/token.c                            | 201 +++++++++++++
> >  kernel/bpf/verifier.c                         |  13 +-
> >  kernel/trace/bpf_trace.c                      |   2 +-
> >  net/core/filter.c                             |  36 +--
> >  net/ipv4/bpf_tcp_ca.c                         |   2 +-
> >  net/netfilter/nf_bpf_link.c                   |   2 +-
> >  tools/include/uapi/linux/bpf.h                |  53 ++++
> >  tools/lib/bpf/bpf.c                           |  35 ++-
> >  tools/lib/bpf/bpf.h                           |  45 ++-
> >  tools/lib/bpf/libbpf.map                      |   1 +
> >  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
> >  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
> >  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
> >  24 files changed, 957 insertions(+), 104 deletions(-)
> >  create mode 100644 kernel/bpf/token.c
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
> >
> > --
> > 2.34.1
> >
> >
>
>
> Hi Andrii,
>
> Thanks for your proposal.
> That seems to be a useful functionality, and yet I have some questions.

I've answered them below. But I don't think either of them have any
relation to BPF token and the problem I'm trying to solve.

>
> 1. Why can't we add security_bpf_probe_read_{kernel,user}?
>     If possible, we can use these LSM hooks to refuse the process to
> read other tasks' information. E.g. if the other process is not within
> the same cgroup or the same namespace, we just refuse the reading. I
> think it is not hard to identify if the other process is within the
> same cgroup or the same namespace.

There are probably many reasons. First, performance-wide, LSM hook for
each bpf_probe_read_{kernel,user}() call will be prohibitive. And just
in general, one would need to be very careful with such LSM hooks,
because bpf_probe_read_{kernel,user}() often happens from NMI context,
and LSM policy would have to be written and validated very carefully
with NMI context in mind.

But, more conceptually, for probe_read you get a random address and
you know the process context you are running in (but you might be
actually running in softirq and NMI, and that process context is
irrelevant). How can you efficiently (or at all) tell if that random
address "belongs" to cgroup or namespace? Just at conceptual level?

>
> 2. Why can't we extend bpf_cookie?
>    We're now using bpf_cookie to identify each user or each
> application, and only the permitted cookies can create new probe
> links.  However we find the bpf_cookie is only supported by tracing,
> perf_event and kprobe_multi, so we're planning to extend it to other
> possible link types, then we can use LSM hooks to control all bpf
> links.  I think that the upstream kernel should also support
> bpf_cookie for all bpf links. If possible, we will post it to the
> upstream in the future.
>    After I have read your BPF token proposal, I just have some other
> ideas. Why can't we just extend bpf_cookie to all other BPF objects?
> For example, all progs and maps should also have the bpf_cookie.
>

I'm not exactly clear how you use BPF cookie, but it wasn't intended
to provide any sort of security or validation policy. It's purely a
user-provided u64 to help distinguish different attach points when the
same BPF program is attached in multiple places (e.g., kprobe tracing
many different kernel functions and needing to distinguish between
them at runtime).

I do agree BPF cookie is super useful and we should keep extending
other types of BPF programs with BPF cookie support, of course. It's
just completely orthogonal to BPF token discussion.


>
> --
> Regards
> Yafang
Andrii Nakryiko July 5, 2023, 8:39 p.m. UTC | #11
On Tue, Jul 4, 2023 at 2:52 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote:
> > Andrii Nakryiko <andrii@kernel.org> writes:
> >
> > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > systemd or any other container manager) to a *trusted* unprivileged
> > > application. Trust is the key here. This functionality is not about allowing
> > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > completely up to the discretion of respective privileged application that
> > > would create a BPF token, as different production setups can and do achieve it
> > > through a combination of different means (signing, LSM, code reviews, etc),
> > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > of validating trustworthiness of particular process.
> > >
> > > The main motivation for BPF token is a desire to enable containerized
> > > BPF applications to be used together with user namespaces. This is currently
> > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > processes belonging to any given namespace. This means that it's impossible to
> > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > no assumption about what "trusted" constitutes in any particular case, and
> > > it's up to specific privileged applications and their surrounding
> > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > So a colleague pointed out today that the Seccomp Notify functionality
> > would be a way to achieve your stated goal of allowing unprivileged
> > containers to (selectively) perform bpf() syscall operations. Christian
> > Brauner has a pretty nice writeup of the functionality here:
> > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development
>
> I'm amazed you read this. :)
> The seccomp notifier comes with a lot of caveats. I think it would be
> impractical if not infeasible to handle bpf() delegation.

Thanks for confirming my hunch.

And yeah, I read a bunch of blog posts from your blog post. The one
about new mount APIs was especially useful given how little
documentation I could find on them otherwise :)

>
> >
> > In fact he even mentions allowing unprivileged access to bpf() as a
> > possible use case (in the second-to-last paragraph).
>
> Yeah, I tried to work around a userspace regression with the
> introduction of the cgroup v2 devices controller.
Yafang Shao July 6, 2023, 1:26 a.m. UTC | #12
On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > systemd or any other container manager) to a *trusted* unprivileged
> > > application. Trust is the key here. This functionality is not about allowing
> > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > completely up to the discretion of respective privileged application that
> > > would create a BPF token, as different production setups can and do achieve it
> > > through a combination of different means (signing, LSM, code reviews, etc),
> > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > of validating trustworthiness of particular process.
> > >
> > > The main motivation for BPF token is a desire to enable containerized
> > > BPF applications to be used together with user namespaces. This is currently
> > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > processes belonging to any given namespace. This means that it's impossible to
> > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > no assumption about what "trusted" constitutes in any particular case, and
> > > it's up to specific privileged applications and their surrounding
> > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > creating new BPF objects like BPF programs, BPF maps, etc.
> > >
> > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > approach, but can be combined with LSM hooks for very fine-grained security
> > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > (context), which in combination with BPF LSM would allow implementing a very
> > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > interest of minimizing API surface area discussions this is going to be
> > > added in follow up patches, as it's not essential to the fundamental concept
> > > of delegatable BPF token.
> > >
> > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > allowing multiple independent instances of them, each with its own set of
> > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > > This addresses main concerns brought up during the /dev/bpf discussion, and
> > > fits better with overall BPF subsystem design.
> > >
> > > This patch set adds a basic minimum of functionality to make BPF token useful
> > > and to discuss API and functionality. Currently only low-level libbpf APIs
> > > support passing BPF token around, allowing to test kernel functionality, but
> > > for the most part is not sufficient for real-world applications, which
> > > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > > was done with the intent to limit the size of patch set and concentrate on
> > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > > as a separate follow up patch set kernel support makes it upstream.
> > >
> > > Another part that should happen once kernel-side BPF token is established, is
> > > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > > at well-defined locations to allow applications take advantage of this in
> > > automatic fashion without explicit code changes on BPF application's side.
> > > But I'd like to postpone this discussion to after BPF token concept lands.
> > >
> > > Once important distinctions from v2 that should be noted is a chance in the
> > > semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> > > token object creation *and* pinning in BPF FS. Such change ensures that BPF
> > > token is always associated with a specific instance of BPF FS and cannot
> > > "escape" it by application re-pinning it somewhere else using another
> > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> > > better containing it inside intended container (under assumption BPF FS is set
> > > up in such a way as to not be shared with other containers on the system).
> > >
> > >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> > >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> > >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> > >
> > > v3->v3-resend:
> > >   - I started integrating token_fd into bpf_object_open_opts and higher-level
> > >     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
> > >     implementation details and how libbpf performs feature detection and
> > >     caching, so I decided to keep it separate from this patch set and not
> > >     distract from the mostly kernel-side changes;
> > > v2->v3:
> > >   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
> > >     BPF_OBJ_PIN for BPF token;
> > > v1->v2:
> > >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> > >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
> > >
> > > Andrii Nakryiko (14):
> > >   bpf: introduce BPF token object
> > >   libbpf: add bpf_token_create() API
> > >   selftests/bpf: add BPF_TOKEN_CREATE test
> > >   bpf: add BPF token support to BPF_MAP_CREATE command
> > >   libbpf: add BPF token support to bpf_map_create() API
> > >   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
> > >   bpf: add BPF token support to BPF_BTF_LOAD command
> > >   libbpf: add BPF token support to bpf_btf_load() API
> > >   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
> > >   bpf: add BPF token support to BPF_PROG_LOAD command
> > >   bpf: take into account BPF token when fetching helper protos
> > >   bpf: consistenly use BPF token throughout BPF verifier logic
> > >   libbpf: add BPF token support to bpf_prog_load() API
> > >   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
> > >
> > >  drivers/media/rc/bpf-lirc.c                   |   2 +-
> > >  include/linux/bpf.h                           |  79 ++++-
> > >  include/linux/filter.h                        |   2 +-
> > >  include/uapi/linux/bpf.h                      |  53 ++++
> > >  kernel/bpf/Makefile                           |   2 +-
> > >  kernel/bpf/arraymap.c                         |   2 +-
> > >  kernel/bpf/cgroup.c                           |   6 +-
> > >  kernel/bpf/core.c                             |   3 +-
> > >  kernel/bpf/helpers.c                          |   6 +-
> > >  kernel/bpf/inode.c                            |  46 ++-
> > >  kernel/bpf/syscall.c                          | 183 +++++++++---
> > >  kernel/bpf/token.c                            | 201 +++++++++++++
> > >  kernel/bpf/verifier.c                         |  13 +-
> > >  kernel/trace/bpf_trace.c                      |   2 +-
> > >  net/core/filter.c                             |  36 +--
> > >  net/ipv4/bpf_tcp_ca.c                         |   2 +-
> > >  net/netfilter/nf_bpf_link.c                   |   2 +-
> > >  tools/include/uapi/linux/bpf.h                |  53 ++++
> > >  tools/lib/bpf/bpf.c                           |  35 ++-
> > >  tools/lib/bpf/bpf.h                           |  45 ++-
> > >  tools/lib/bpf/libbpf.map                      |   1 +
> > >  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
> > >  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
> > >  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
> > >  24 files changed, 957 insertions(+), 104 deletions(-)
> > >  create mode 100644 kernel/bpf/token.c
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
> > >
> > > --
> > > 2.34.1
> > >
> > >
> >
> >
> > Hi Andrii,
> >
> > Thanks for your proposal.
> > That seems to be a useful functionality, and yet I have some questions.
>
> I've answered them below. But I don't think either of them have any
> relation to BPF token and the problem I'm trying to solve.
>
> >
> > 1. Why can't we add security_bpf_probe_read_{kernel,user}?
> >     If possible, we can use these LSM hooks to refuse the process to
> > read other tasks' information. E.g. if the other process is not within
> > the same cgroup or the same namespace, we just refuse the reading. I
> > think it is not hard to identify if the other process is within the
> > same cgroup or the same namespace.
>
> There are probably many reasons. First, performance-wide, LSM hook for
> each bpf_probe_read_{kernel,user}() call will be prohibitive. And just
> in general, one would need to be very careful with such LSM hooks,
> because bpf_probe_read_{kernel,user}() often happens from NMI context,
> and LSM policy would have to be written and validated very carefully
> with NMI context in mind.
>
> But, more conceptually, for probe_read you get a random address and
> you know the process context you are running in (but you might be
> actually running in softirq and NMI, and that process context is
> irrelevant). How can you efficiently (or at all) tell if that random
> address "belongs" to cgroup or namespace? Just at conceptual level?
>
> >
> > 2. Why can't we extend bpf_cookie?
> >    We're now using bpf_cookie to identify each user or each
> > application, and only the permitted cookies can create new probe
> > links.  However we find the bpf_cookie is only supported by tracing,
> > perf_event and kprobe_multi, so we're planning to extend it to other
> > possible link types, then we can use LSM hooks to control all bpf
> > links.  I think that the upstream kernel should also support
> > bpf_cookie for all bpf links. If possible, we will post it to the
> > upstream in the future.
> >    After I have read your BPF token proposal, I just have some other
> > ideas. Why can't we just extend bpf_cookie to all other BPF objects?
> > For example, all progs and maps should also have the bpf_cookie.
> >
>
> I'm not exactly clear how you use BPF cookie, but it wasn't intended
> to provide any sort of security or validation policy. It's purely a
> user-provided u64 to help distinguish different attach points when the
> same BPF program is attached in multiple places (e.g., kprobe tracing
> many different kernel functions and needing to distinguish between
> them at runtime).

In our container environment, we enable the CAP_BPF, CAP_PERMON and
CAP_NET_ADMIN for the containers which want to run BPF programs
inside. However we don't want them to run whatever BPF programs they
want. We only allow them to run the BPF programs we have permitted for
each of them.  So we are using LSM to audit the BPF behavior such as
prog load, map creation and link attach.  We define different BPF
policies for different containers. In order to identify different
containers efficiently, we assign different bpf_cookies for different
containers. bpf_cookie is a u64, that's enough for our use cases.
We didn't use cgroup id to identify different containers because
cgroup id is a local value in a server, while bpf_cookie is a global
value, that would be easy for deployment.
For your use cases, maybe we could enable CAP_BPF (+CAP_PERMON,
+CAP_NET_ADMIN) for all users, and then we assign different
bpf_cookies for different users, so we can use LSM to allow the user
who have the permitted cookies to run BPF program ?

>
> I do agree BPF cookie is super useful and we should keep extending
> other types of BPF programs with BPF cookie support, of course. It's
> just completely orthogonal to BPF token discussion.
>
Andrii Nakryiko July 6, 2023, 8:34 p.m. UTC | #13
On Wed, Jul 5, 2023 at 6:27 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > > systemd or any other container manager) to a *trusted* unprivileged
> > > > application. Trust is the key here. This functionality is not about allowing
> > > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > > completely up to the discretion of respective privileged application that
> > > > would create a BPF token, as different production setups can and do achieve it
> > > > through a combination of different means (signing, LSM, code reviews, etc),
> > > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > > of validating trustworthiness of particular process.
> > > >
> > > > The main motivation for BPF token is a desire to enable containerized
> > > > BPF applications to be used together with user namespaces. This is currently
> > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > > processes belonging to any given namespace. This means that it's impossible to
> > > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > > no assumption about what "trusted" constitutes in any particular case, and
> > > > it's up to specific privileged applications and their surrounding
> > > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > >
> > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > interest of minimizing API surface area discussions this is going to be
> > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > of delegatable BPF token.
> > > >
> > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > allowing multiple independent instances of them, each with its own set of
> > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > > > This addresses main concerns brought up during the /dev/bpf discussion, and
> > > > fits better with overall BPF subsystem design.
> > > >
> > > > This patch set adds a basic minimum of functionality to make BPF token useful
> > > > and to discuss API and functionality. Currently only low-level libbpf APIs
> > > > support passing BPF token around, allowing to test kernel functionality, but
> > > > for the most part is not sufficient for real-world applications, which
> > > > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > > > was done with the intent to limit the size of patch set and concentrate on
> > > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > > > as a separate follow up patch set kernel support makes it upstream.
> > > >
> > > > Another part that should happen once kernel-side BPF token is established, is
> > > > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > > > at well-defined locations to allow applications take advantage of this in
> > > > automatic fashion without explicit code changes on BPF application's side.
> > > > But I'd like to postpone this discussion to after BPF token concept lands.
> > > >
> > > > Once important distinctions from v2 that should be noted is a chance in the
> > > > semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> > > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> > > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> > > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> > > > token object creation *and* pinning in BPF FS. Such change ensures that BPF
> > > > token is always associated with a specific instance of BPF FS and cannot
> > > > "escape" it by application re-pinning it somewhere else using another
> > > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> > > > better containing it inside intended container (under assumption BPF FS is set
> > > > up in such a way as to not be shared with other containers on the system).
> > > >
> > > >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> > > >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> > > >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> > > >
> > > > v3->v3-resend:
> > > >   - I started integrating token_fd into bpf_object_open_opts and higher-level
> > > >     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
> > > >     implementation details and how libbpf performs feature detection and
> > > >     caching, so I decided to keep it separate from this patch set and not
> > > >     distract from the mostly kernel-side changes;
> > > > v2->v3:
> > > >   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
> > > >     BPF_OBJ_PIN for BPF token;
> > > > v1->v2:
> > > >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> > > >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
> > > >
> > > > Andrii Nakryiko (14):
> > > >   bpf: introduce BPF token object
> > > >   libbpf: add bpf_token_create() API
> > > >   selftests/bpf: add BPF_TOKEN_CREATE test
> > > >   bpf: add BPF token support to BPF_MAP_CREATE command
> > > >   libbpf: add BPF token support to bpf_map_create() API
> > > >   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
> > > >   bpf: add BPF token support to BPF_BTF_LOAD command
> > > >   libbpf: add BPF token support to bpf_btf_load() API
> > > >   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
> > > >   bpf: add BPF token support to BPF_PROG_LOAD command
> > > >   bpf: take into account BPF token when fetching helper protos
> > > >   bpf: consistenly use BPF token throughout BPF verifier logic
> > > >   libbpf: add BPF token support to bpf_prog_load() API
> > > >   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
> > > >
> > > >  drivers/media/rc/bpf-lirc.c                   |   2 +-
> > > >  include/linux/bpf.h                           |  79 ++++-
> > > >  include/linux/filter.h                        |   2 +-
> > > >  include/uapi/linux/bpf.h                      |  53 ++++
> > > >  kernel/bpf/Makefile                           |   2 +-
> > > >  kernel/bpf/arraymap.c                         |   2 +-
> > > >  kernel/bpf/cgroup.c                           |   6 +-
> > > >  kernel/bpf/core.c                             |   3 +-
> > > >  kernel/bpf/helpers.c                          |   6 +-
> > > >  kernel/bpf/inode.c                            |  46 ++-
> > > >  kernel/bpf/syscall.c                          | 183 +++++++++---
> > > >  kernel/bpf/token.c                            | 201 +++++++++++++
> > > >  kernel/bpf/verifier.c                         |  13 +-
> > > >  kernel/trace/bpf_trace.c                      |   2 +-
> > > >  net/core/filter.c                             |  36 +--
> > > >  net/ipv4/bpf_tcp_ca.c                         |   2 +-
> > > >  net/netfilter/nf_bpf_link.c                   |   2 +-
> > > >  tools/include/uapi/linux/bpf.h                |  53 ++++
> > > >  tools/lib/bpf/bpf.c                           |  35 ++-
> > > >  tools/lib/bpf/bpf.h                           |  45 ++-
> > > >  tools/lib/bpf/libbpf.map                      |   1 +
> > > >  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
> > > >  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
> > > >  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
> > > >  24 files changed, 957 insertions(+), 104 deletions(-)
> > > >  create mode 100644 kernel/bpf/token.c
> > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
> > > >
> > > > --
> > > > 2.34.1
> > > >
> > > >
> > >
> > >
> > > Hi Andrii,
> > >
> > > Thanks for your proposal.
> > > That seems to be a useful functionality, and yet I have some questions.
> >
> > I've answered them below. But I don't think either of them have any
> > relation to BPF token and the problem I'm trying to solve.
> >
> > >
> > > 1. Why can't we add security_bpf_probe_read_{kernel,user}?
> > >     If possible, we can use these LSM hooks to refuse the process to
> > > read other tasks' information. E.g. if the other process is not within
> > > the same cgroup or the same namespace, we just refuse the reading. I
> > > think it is not hard to identify if the other process is within the
> > > same cgroup or the same namespace.
> >
> > There are probably many reasons. First, performance-wide, LSM hook for
> > each bpf_probe_read_{kernel,user}() call will be prohibitive. And just
> > in general, one would need to be very careful with such LSM hooks,
> > because bpf_probe_read_{kernel,user}() often happens from NMI context,
> > and LSM policy would have to be written and validated very carefully
> > with NMI context in mind.
> >
> > But, more conceptually, for probe_read you get a random address and
> > you know the process context you are running in (but you might be
> > actually running in softirq and NMI, and that process context is
> > irrelevant). How can you efficiently (or at all) tell if that random
> > address "belongs" to cgroup or namespace? Just at conceptual level?
> >
> > >
> > > 2. Why can't we extend bpf_cookie?
> > >    We're now using bpf_cookie to identify each user or each
> > > application, and only the permitted cookies can create new probe
> > > links.  However we find the bpf_cookie is only supported by tracing,
> > > perf_event and kprobe_multi, so we're planning to extend it to other
> > > possible link types, then we can use LSM hooks to control all bpf
> > > links.  I think that the upstream kernel should also support
> > > bpf_cookie for all bpf links. If possible, we will post it to the
> > > upstream in the future.
> > >    After I have read your BPF token proposal, I just have some other
> > > ideas. Why can't we just extend bpf_cookie to all other BPF objects?
> > > For example, all progs and maps should also have the bpf_cookie.
> > >
> >
> > I'm not exactly clear how you use BPF cookie, but it wasn't intended
> > to provide any sort of security or validation policy. It's purely a
> > user-provided u64 to help distinguish different attach points when the
> > same BPF program is attached in multiple places (e.g., kprobe tracing
> > many different kernel functions and needing to distinguish between
> > them at runtime).
>
> In our container environment, we enable the CAP_BPF, CAP_PERMON and
> CAP_NET_ADMIN for the containers which want to run BPF programs
> inside. However we don't want them to run whatever BPF programs they
> want. We only allow them to run the BPF programs we have permitted for
> each of them.  So we are using LSM to audit the BPF behavior such as
> prog load, map creation and link attach.  We define different BPF
> policies for different containers. In order to identify different
> containers efficiently, we assign different bpf_cookies for different
> containers. bpf_cookie is a u64, that's enough for our use cases.

I can see how you can use BPF cookies for this, but it's certainly not
an intended use case :) BPF cookie is most useful on BPF side of
things.

But what you are describing is meant to be doable with BPF token. It's
not in first patch set, but I intended to allow user to specify an
extra "user context" blog of bytes which would be stored with BPF
token. And this data should be accessible from BPF LSM programs to
make extra custom policy decisions. But we need to agree on initial
BPF token stuff first, and then build out all the rest.

> We didn't use cgroup id to identify different containers because
> cgroup id is a local value in a server, while bpf_cookie is a global
> value, that would be easy for deployment.
> For your use cases, maybe we could enable CAP_BPF (+CAP_PERMON,
> +CAP_NET_ADMIN) for all users, and then we assign different
> bpf_cookies for different users, so we can use LSM to allow the user
> who have the permitted cookies to run BPF program ?
>
> >
> > I do agree BPF cookie is super useful and we should keep extending
> > other types of BPF programs with BPF cookie support, of course. It's
> > just completely orthogonal to BPF token discussion.
> >
>
> --
> Regards
> Yafang
Yafang Shao July 7, 2023, 1:42 a.m. UTC | #14
On Fri, Jul 7, 2023 at 4:34 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 5, 2023 at 6:27 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > > This patch set introduces new BPF object, BPF token, which allows to delegate
> > > > > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > > > > systemd or any other container manager) to a *trusted* unprivileged
> > > > > application. Trust is the key here. This functionality is not about allowing
> > > > > unconditional unprivileged BPF usage. Establishing trust, though, is
> > > > > completely up to the discretion of respective privileged application that
> > > > > would create a BPF token, as different production setups can and do achieve it
> > > > > through a combination of different means (signing, LSM, code reviews, etc),
> > > > > and it's undesirable and infeasible for kernel to enforce any particular way
> > > > > of validating trustworthiness of particular process.
> > > > >
> > > > > The main motivation for BPF token is a desire to enable containerized
> > > > > BPF applications to be used together with user namespaces. This is currently
> > > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > > > > arbitrary memory, and it's impossible to ensure that they only read memory of
> > > > > processes belonging to any given namespace. This means that it's impossible to
> > > > > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > > > > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > > > > no assumption about what "trusted" constitutes in any particular case, and
> > > > > it's up to specific privileged applications and their surrounding
> > > > > infrastructure to decide that. What kernel provides is a set of APIs to create
> > > > > and tune BPF token, and pass it around to privileged BPF commands that are
> > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > >
> > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > interest of minimizing API surface area discussions this is going to be
> > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > of delegatable BPF token.
> > > > >
> > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > allowing multiple independent instances of them, each with its own set of
> > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > > > > This addresses main concerns brought up during the /dev/bpf discussion, and
> > > > > fits better with overall BPF subsystem design.
> > > > >
> > > > > This patch set adds a basic minimum of functionality to make BPF token useful
> > > > > and to discuss API and functionality. Currently only low-level libbpf APIs
> > > > > support passing BPF token around, allowing to test kernel functionality, but
> > > > > for the most part is not sufficient for real-world applications, which
> > > > > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > > > > was done with the intent to limit the size of patch set and concentrate on
> > > > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > > > > as a separate follow up patch set kernel support makes it upstream.
> > > > >
> > > > > Another part that should happen once kernel-side BPF token is established, is
> > > > > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > > > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > > > > at well-defined locations to allow applications take advantage of this in
> > > > > automatic fashion without explicit code changes on BPF application's side.
> > > > > But I'd like to postpone this discussion to after BPF token concept lands.
> > > > >
> > > > > Once important distinctions from v2 that should be noted is a chance in the
> > > > > semantics of a newly added BPF_TOKEN_CREATE command. Previously,
> > > > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to
> > > > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN
> > > > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF
> > > > > token object creation *and* pinning in BPF FS. Such change ensures that BPF
> > > > > token is always associated with a specific instance of BPF FS and cannot
> > > > > "escape" it by application re-pinning it somewhere else using another
> > > > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation,
> > > > > better containing it inside intended container (under assumption BPF FS is set
> > > > > up in such a way as to not be shared with other containers on the system).
> > > > >
> > > > >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> > > > >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> > > > >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> > > > >
> > > > > v3->v3-resend:
> > > > >   - I started integrating token_fd into bpf_object_open_opts and higher-level
> > > > >     libbpf bpf_object APIs, but it started going a bit deeper into bpf_object
> > > > >     implementation details and how libbpf performs feature detection and
> > > > >     caching, so I decided to keep it separate from this patch set and not
> > > > >     distract from the mostly kernel-side changes;
> > > > > v2->v3:
> > > > >   - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
> > > > >     BPF_OBJ_PIN for BPF token;
> > > > > v1->v2:
> > > > >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> > > > >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
> > > > >
> > > > > Andrii Nakryiko (14):
> > > > >   bpf: introduce BPF token object
> > > > >   libbpf: add bpf_token_create() API
> > > > >   selftests/bpf: add BPF_TOKEN_CREATE test
> > > > >   bpf: add BPF token support to BPF_MAP_CREATE command
> > > > >   libbpf: add BPF token support to bpf_map_create() API
> > > > >   selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
> > > > >   bpf: add BPF token support to BPF_BTF_LOAD command
> > > > >   libbpf: add BPF token support to bpf_btf_load() API
> > > > >   selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
> > > > >   bpf: add BPF token support to BPF_PROG_LOAD command
> > > > >   bpf: take into account BPF token when fetching helper protos
> > > > >   bpf: consistenly use BPF token throughout BPF verifier logic
> > > > >   libbpf: add BPF token support to bpf_prog_load() API
> > > > >   selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests
> > > > >
> > > > >  drivers/media/rc/bpf-lirc.c                   |   2 +-
> > > > >  include/linux/bpf.h                           |  79 ++++-
> > > > >  include/linux/filter.h                        |   2 +-
> > > > >  include/uapi/linux/bpf.h                      |  53 ++++
> > > > >  kernel/bpf/Makefile                           |   2 +-
> > > > >  kernel/bpf/arraymap.c                         |   2 +-
> > > > >  kernel/bpf/cgroup.c                           |   6 +-
> > > > >  kernel/bpf/core.c                             |   3 +-
> > > > >  kernel/bpf/helpers.c                          |   6 +-
> > > > >  kernel/bpf/inode.c                            |  46 ++-
> > > > >  kernel/bpf/syscall.c                          | 183 +++++++++---
> > > > >  kernel/bpf/token.c                            | 201 +++++++++++++
> > > > >  kernel/bpf/verifier.c                         |  13 +-
> > > > >  kernel/trace/bpf_trace.c                      |   2 +-
> > > > >  net/core/filter.c                             |  36 +--
> > > > >  net/ipv4/bpf_tcp_ca.c                         |   2 +-
> > > > >  net/netfilter/nf_bpf_link.c                   |   2 +-
> > > > >  tools/include/uapi/linux/bpf.h                |  53 ++++
> > > > >  tools/lib/bpf/bpf.c                           |  35 ++-
> > > > >  tools/lib/bpf/bpf.h                           |  45 ++-
> > > > >  tools/lib/bpf/libbpf.map                      |   1 +
> > > > >  .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
> > > > >  .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
> > > > >  .../testing/selftests/bpf/prog_tests/token.c  | 277 ++++++++++++++++++
> > > > >  24 files changed, 957 insertions(+), 104 deletions(-)
> > > > >  create mode 100644 kernel/bpf/token.c
> > > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
> > > > >
> > > > > --
> > > > > 2.34.1
> > > > >
> > > > >
> > > >
> > > >
> > > > Hi Andrii,
> > > >
> > > > Thanks for your proposal.
> > > > That seems to be a useful functionality, and yet I have some questions.
> > >
> > > I've answered them below. But I don't think either of them have any
> > > relation to BPF token and the problem I'm trying to solve.
> > >
> > > >
> > > > 1. Why can't we add security_bpf_probe_read_{kernel,user}?
> > > >     If possible, we can use these LSM hooks to refuse the process to
> > > > read other tasks' information. E.g. if the other process is not within
> > > > the same cgroup or the same namespace, we just refuse the reading. I
> > > > think it is not hard to identify if the other process is within the
> > > > same cgroup or the same namespace.
> > >
> > > There are probably many reasons. First, performance-wide, LSM hook for
> > > each bpf_probe_read_{kernel,user}() call will be prohibitive. And just
> > > in general, one would need to be very careful with such LSM hooks,
> > > because bpf_probe_read_{kernel,user}() often happens from NMI context,
> > > and LSM policy would have to be written and validated very carefully
> > > with NMI context in mind.
> > >
> > > But, more conceptually, for probe_read you get a random address and
> > > you know the process context you are running in (but you might be
> > > actually running in softirq and NMI, and that process context is
> > > irrelevant). How can you efficiently (or at all) tell if that random
> > > address "belongs" to cgroup or namespace? Just at conceptual level?
> > >
> > > >
> > > > 2. Why can't we extend bpf_cookie?
> > > >    We're now using bpf_cookie to identify each user or each
> > > > application, and only the permitted cookies can create new probe
> > > > links.  However we find the bpf_cookie is only supported by tracing,
> > > > perf_event and kprobe_multi, so we're planning to extend it to other
> > > > possible link types, then we can use LSM hooks to control all bpf
> > > > links.  I think that the upstream kernel should also support
> > > > bpf_cookie for all bpf links. If possible, we will post it to the
> > > > upstream in the future.
> > > >    After I have read your BPF token proposal, I just have some other
> > > > ideas. Why can't we just extend bpf_cookie to all other BPF objects?
> > > > For example, all progs and maps should also have the bpf_cookie.
> > > >
> > >
> > > I'm not exactly clear how you use BPF cookie, but it wasn't intended
> > > to provide any sort of security or validation policy. It's purely a
> > > user-provided u64 to help distinguish different attach points when the
> > > same BPF program is attached in multiple places (e.g., kprobe tracing
> > > many different kernel functions and needing to distinguish between
> > > them at runtime).
> >
> > In our container environment, we enable the CAP_BPF, CAP_PERMON and
> > CAP_NET_ADMIN for the containers which want to run BPF programs
> > inside. However we don't want them to run whatever BPF programs they
> > want. We only allow them to run the BPF programs we have permitted for
> > each of them.  So we are using LSM to audit the BPF behavior such as
> > prog load, map creation and link attach.  We define different BPF
> > policies for different containers. In order to identify different
> > containers efficiently, we assign different bpf_cookies for different
> > containers. bpf_cookie is a u64, that's enough for our use cases.
>
> I can see how you can use BPF cookies for this, but it's certainly not
> an intended use case :) BPF cookie is most useful on BPF side of
> things.

The utilization of the bpf_cookie appid in our use case has proven to
be valuable, thus we continue to rely on its functionality :)

>
> But what you are describing is meant to be doable with BPF token. It's
> not in first patch set, but I intended to allow user to specify an
> extra "user context" blog of bytes which would be stored with BPF
> token. And this data should be accessible from BPF LSM programs to
> make extra custom policy decisions. But we need to agree on initial
> BPF token stuff first, and then build out all the rest.

Sounds good. Introducing support for user context within the BPF token
would enhance its utility and provide even more valuable
functionality.