Message ID | 20230629051832.897119-1-andrii@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | BPF token | expand |
Andrii Nakryiko <andrii@kernel.org> writes: > This patch set introduces new BPF object, BPF token, which allows to delegate > a subset of BPF functionality from privileged system-wide daemon (e.g., > systemd or any other container manager) to a *trusted* unprivileged > application. Trust is the key here. This functionality is not about allowing > unconditional unprivileged BPF usage. Establishing trust, though, is > completely up to the discretion of respective privileged application that > would create a BPF token, as different production setups can and do achieve it > through a combination of different means (signing, LSM, code reviews, etc), > and it's undesirable and infeasible for kernel to enforce any particular way > of validating trustworthiness of particular process. > > The main motivation for BPF token is a desire to enable containerized > BPF applications to be used together with user namespaces. This is currently > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > arbitrary memory, and it's impossible to ensure that they only read memory of > processes belonging to any given namespace. This means that it's impossible to > have namespace-aware CAP_BPF capability, and as such another mechanism to > allow safe usage of BPF functionality is necessary. BPF token and delegation > of it to a trusted unprivileged applications is such mechanism. Kernel makes > no assumption about what "trusted" constitutes in any particular case, and > it's up to specific privileged applications and their surrounding > infrastructure to decide that. What kernel provides is a set of APIs to create > and tune BPF token, and pass it around to privileged BPF commands that are > creating new BPF objects like BPF programs, BPF maps, etc. So a colleague pointed out today that the Seccomp Notify functionality would be a way to achieve your stated goal of allowing unprivileged containers to (selectively) perform bpf() syscall operations. Christian Brauner has a pretty nice writeup of the functionality here: https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development In fact he even mentions allowing unprivileged access to bpf() as a possible use case (in the second-to-last paragraph). AFAICT this would enable your use case without adding any new kernel functionality or changing the BPF-using applications, while allowing the privileged userspace daemon to make case-by-case decisions on each operation instead of granting blanket capabilities (which is my main objection to the token proposal, as we discussed on the last iteration of the series). So I'm curious whether you considered this as an alternative to BPF_TOKEN? And if so, what your reason was for rejecting it? -Toke
On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii@kernel.org> writes: > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token, as different production setups can and do achieve it > > through a combination of different means (signing, LSM, code reviews, etc), > > and it's undesirable and infeasible for kernel to enforce any particular way > > of validating trustworthiness of particular process. > > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > So a colleague pointed out today that the Seccomp Notify functionality > would be a way to achieve your stated goal of allowing unprivileged > containers to (selectively) perform bpf() syscall operations. Christian > Brauner has a pretty nice writeup of the functionality here: > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development > > In fact he even mentions allowing unprivileged access to bpf() as a > possible use case (in the second-to-last paragraph). > > AFAICT this would enable your use case without adding any new kernel > functionality or changing the BPF-using applications, while allowing the > privileged userspace daemon to make case-by-case decisions on each > operation instead of granting blanket capabilities (which is my main > objection to the token proposal, as we discussed on the last iteration > of the series). It's not "blanket" capabilities. You control types or maps and programs that could be created. And again, CAP_SYS_ADMIN guarded. Please, don't give CAP_SYS_ADMIN/root permissions to applications you can't be sure won't do something stupid and blame kernel API for it. After all, the root process can setuid() any file and make it run with elevated permissions, right? Doesn't get more "blanket" than that. > > So I'm curious whether you considered this as an alternative to > BPF_TOKEN? And if so, what your reason was for rejecting it? > Yes, I'm aware, Christian has a follow up short blog post specifically for using this for proxying BPF from privileged process ([0]). So, in short, I think it's not a good generic solution. It's very fragile and high-maintenance. It's still proxying BPF UAPI (except application does preserve illusion of using BPF syscall, yes, that part is good) with all the implications: needing to replicate all of UAPI (fetching all those FDs from another process, following all the pointers from another process' memory, etc), and also writing back all the correct things (into another process' memory): log content, log_true_size (out param), any other output parameters. What do we do when an application uses a newer version of bpf_attr that is supported by proxy? And honestly, I'm like 99% sure there are lots of less obvious issues one runs into when starting implementing something like this. This sounds like a hack and nightmare to implement and support. Perhaps that indirectly is supported by the fact that even Christian half-jokingly calls this a crazy approach. That code basically is unchanged for the last three years, with only one fix from Christian one year after initial introduction ([1]) to fix a quirky issue related to the limitation of pidfd working only for thread group leaders. It also still supports only BPF_PROG_TYPE_CGROUP_DEVICE program loading, it doesn't support a bunch of newer BPF_PROG_LOAD fields and functionality, etc, etc. So as a technical curiosity it's pretty cool and perhaps is the right tool for the job for very narrow specific use cases. But as a realistic generic approach that could be used by industry at large for safe BPF usage from namespaced containers -- not so much. [0] https://brauner.io/2020/08/07/seccomp-notify-intercepting-the-bpf-syscall.html [1] https://github.com/lxc/lxd/commit/566d0a3b3cbe288787886c2f3bf5b250ceb930b0 > -Toke >
On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > This patch set introduces new BPF object, BPF token, which allows to delegate > a subset of BPF functionality from privileged system-wide daemon (e.g., > systemd or any other container manager) to a *trusted* unprivileged > application. Trust is the key here. This functionality is not about allowing > unconditional unprivileged BPF usage. Establishing trust, though, is > completely up to the discretion of respective privileged application that > would create a BPF token, as different production setups can and do achieve it > through a combination of different means (signing, LSM, code reviews, etc), > and it's undesirable and infeasible for kernel to enforce any particular way > of validating trustworthiness of particular process. > > The main motivation for BPF token is a desire to enable containerized > BPF applications to be used together with user namespaces. This is currently > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > arbitrary memory, and it's impossible to ensure that they only read memory of > processes belonging to any given namespace. This means that it's impossible to > have namespace-aware CAP_BPF capability, and as such another mechanism to > allow safe usage of BPF functionality is necessary. BPF token and delegation > of it to a trusted unprivileged applications is such mechanism. Kernel makes > no assumption about what "trusted" constitutes in any particular case, and > it's up to specific privileged applications and their surrounding > infrastructure to decide that. What kernel provides is a set of APIs to create > and tune BPF token, and pass it around to privileged BPF commands that are > creating new BPF objects like BPF programs, BPF maps, etc. > > Previous attempt at addressing this very same problem ([0]) attempted to > utilize authoritative LSM approach, but was conclusively rejected by upstream > LSM maintainers. BPF token concept is not changing anything about LSM > approach, but can be combined with LSM hooks for very fine-grained security > policy. Some ideas about making BPF token more convenient to use with LSM (in > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > 2023 presentation ([1]). E.g., an ability to specify user-provided data > (context), which in combination with BPF LSM would allow implementing a very > dynamic and fine-granular custom security policies on top of BPF token. In the > interest of minimizing API surface area discussions this is going to be > added in follow up patches, as it's not essential to the fundamental concept > of delegatable BPF token. > > It should be noted that BPF token is conceptually quite similar to the idea of > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > difference is the idea of using virtual anon_inode file to hold BPF token and > allowing multiple independent instances of them, each with its own set of > restrictions. BPF pinning solves the problem of exposing such BPF token > through file system (BPF FS, in this case) for cases where transferring FDs > over Unix domain sockets is not convenient. And also, crucially, BPF token > approach is not using any special stateful task-scoped flags. Instead, bpf() > syscall accepts token_fd parameters explicitly for each relevant BPF command. > This addresses main concerns brought up during the /dev/bpf discussion, and > fits better with overall BPF subsystem design. > > This patch set adds a basic minimum of functionality to make BPF token useful > and to discuss API and functionality. Currently only low-level libbpf APIs > support passing BPF token around, allowing to test kernel functionality, but > for the most part is not sufficient for real-world applications, which > typically use high-level libbpf APIs based on `struct bpf_object` type. This > was done with the intent to limit the size of patch set and concentrate on > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > as a separate follow up patch set kernel support makes it upstream. > > Another part that should happen once kernel-side BPF token is established, is > a set of conventions between applications (e.g., systemd), tools (e.g., > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > at well-defined locations to allow applications take advantage of this in > automatic fashion without explicit code changes on BPF application's side. > But I'd like to postpone this discussion to after BPF token concept lands. > > Once important distinctions from v2 that should be noted is a chance in the > semantics of a newly added BPF_TOKEN_CREATE command. Previously, > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF > token object creation *and* pinning in BPF FS. Such change ensures that BPF > token is always associated with a specific instance of BPF FS and cannot > "escape" it by application re-pinning it somewhere else using another > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation, > better containing it inside intended container (under assumption BPF FS is set > up in such a way as to not be shared with other containers on the system). > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > > v3->v3-resend: > - I started integrating token_fd into bpf_object_open_opts and higher-level > libbpf bpf_object APIs, but it started going a bit deeper into bpf_object > implementation details and how libbpf performs feature detection and > caching, so I decided to keep it separate from this patch set and not > distract from the mostly kernel-side changes; > v2->v3: > - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow > BPF_OBJ_PIN for BPF token; > v1->v2: > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). > > Andrii Nakryiko (14): > bpf: introduce BPF token object > libbpf: add bpf_token_create() API > selftests/bpf: add BPF_TOKEN_CREATE test > bpf: add BPF token support to BPF_MAP_CREATE command > libbpf: add BPF token support to bpf_map_create() API > selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command > bpf: add BPF token support to BPF_BTF_LOAD command > libbpf: add BPF token support to bpf_btf_load() API > selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest > bpf: add BPF token support to BPF_PROG_LOAD command > bpf: take into account BPF token when fetching helper protos > bpf: consistenly use BPF token throughout BPF verifier logic > libbpf: add BPF token support to bpf_prog_load() API > selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests > > drivers/media/rc/bpf-lirc.c | 2 +- > include/linux/bpf.h | 79 ++++- > include/linux/filter.h | 2 +- > include/uapi/linux/bpf.h | 53 ++++ > kernel/bpf/Makefile | 2 +- > kernel/bpf/arraymap.c | 2 +- > kernel/bpf/cgroup.c | 6 +- > kernel/bpf/core.c | 3 +- > kernel/bpf/helpers.c | 6 +- > kernel/bpf/inode.c | 46 ++- > kernel/bpf/syscall.c | 183 +++++++++--- > kernel/bpf/token.c | 201 +++++++++++++ > kernel/bpf/verifier.c | 13 +- > kernel/trace/bpf_trace.c | 2 +- > net/core/filter.c | 36 +-- > net/ipv4/bpf_tcp_ca.c | 2 +- > net/netfilter/nf_bpf_link.c | 2 +- > tools/include/uapi/linux/bpf.h | 53 ++++ > tools/lib/bpf/bpf.c | 35 ++- > tools/lib/bpf/bpf.h | 45 ++- > tools/lib/bpf/libbpf.map | 1 + > .../selftests/bpf/prog_tests/libbpf_probes.c | 4 + > .../selftests/bpf/prog_tests/libbpf_str.c | 6 + > .../testing/selftests/bpf/prog_tests/token.c | 277 ++++++++++++++++++ > 24 files changed, 957 insertions(+), 104 deletions(-) > create mode 100644 kernel/bpf/token.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c > > -- > 2.34.1 > > Hi Andrii, Thanks for your proposal. That seems to be a useful functionality, and yet I have some questions. 1. Why can't we add security_bpf_probe_read_{kernel,user}? If possible, we can use these LSM hooks to refuse the process to read other tasks' information. E.g. if the other process is not within the same cgroup or the same namespace, we just refuse the reading. I think it is not hard to identify if the other process is within the same cgroup or the same namespace. 2. Why can't we extend bpf_cookie? We're now using bpf_cookie to identify each user or each application, and only the permitted cookies can create new probe links. However we find the bpf_cookie is only supported by tracing, perf_event and kprobe_multi, so we're planning to extend it to other possible link types, then we can use LSM hooks to control all bpf links. I think that the upstream kernel should also support bpf_cookie for all bpf links. If possible, we will post it to the upstream in the future. After I have read your BPF token proposal, I just have some other ideas. Why can't we just extend bpf_cookie to all other BPF objects? For example, all progs and maps should also have the bpf_cookie.
On Fri, Jun 30, 2023 at 1:16 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii@kernel.org> writes: > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token, as different production setups can and do achieve it > > through a combination of different means (signing, LSM, code reviews, etc), > > and it's undesirable and infeasible for kernel to enforce any particular way > > of validating trustworthiness of particular process. > > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > So a colleague pointed out today that the Seccomp Notify functionality > would be a way to achieve your stated goal of allowing unprivileged > containers to (selectively) perform bpf() syscall operations. Christian > Brauner has a pretty nice writeup of the functionality here: > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development > > In fact he even mentions allowing unprivileged access to bpf() as a > possible use case (in the second-to-last paragraph). > > AFAICT this would enable your use case without adding any new kernel > functionality or changing the BPF-using applications, while allowing the > privileged userspace daemon to make case-by-case decisions on each > operation instead of granting blanket capabilities (which is my main > objection to the token proposal, as we discussed on the last iteration > of the series). > > So I'm curious whether you considered this as an alternative to > BPF_TOKEN? And if so, what your reason was for rejecting it? The Seccomp notifier is an answer 1. to special device nodes (or arguably to simple cases...) , 2. a quick solution without changing infrastructure and how the kernel deals with device nodes (doesn't solve the root problem where this BPF series at least tries...), 3. relies on Seccomp and would inherit its same limitation. It clashes with BPF! BPF is not mknod, and most of its use cases are *transparent to the workload*, they can't use Seccomp and are not interested in it... Fd delegation is good design and applies to *all* BPF use cases, all tools can take advantage of it, it is not restricted to a special tool or daemon X. Going further, hiding behind Seccomp notifier and such prevents BPF from solving current and future problems.
On Fri, Jun 30, 2023 at 11:25:57AM -0700, Andrii Nakryiko wrote: > On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > > > Andrii Nakryiko <andrii@kernel.org> writes: > > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > > systemd or any other container manager) to a *trusted* unprivileged > > > application. Trust is the key here. This functionality is not about allowing > > > unconditional unprivileged BPF usage. Establishing trust, though, is > > > completely up to the discretion of respective privileged application that > > > would create a BPF token, as different production setups can and do achieve it > > > through a combination of different means (signing, LSM, code reviews, etc), > > > and it's undesirable and infeasible for kernel to enforce any particular way > > > of validating trustworthiness of particular process. > > > > > > The main motivation for BPF token is a desire to enable containerized > > > BPF applications to be used together with user namespaces. This is currently > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > > arbitrary memory, and it's impossible to ensure that they only read memory of > > > processes belonging to any given namespace. This means that it's impossible to > > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > > no assumption about what "trusted" constitutes in any particular case, and > > > it's up to specific privileged applications and their surrounding > > > infrastructure to decide that. What kernel provides is a set of APIs to create > > > and tune BPF token, and pass it around to privileged BPF commands that are > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > So a colleague pointed out today that the Seccomp Notify functionality > > would be a way to achieve your stated goal of allowing unprivileged > > containers to (selectively) perform bpf() syscall operations. Christian > > Brauner has a pretty nice writeup of the functionality here: > > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development > > > > In fact he even mentions allowing unprivileged access to bpf() as a > > possible use case (in the second-to-last paragraph). > > > > AFAICT this would enable your use case without adding any new kernel > > functionality or changing the BPF-using applications, while allowing the > > privileged userspace daemon to make case-by-case decisions on each > > operation instead of granting blanket capabilities (which is my main > > objection to the token proposal, as we discussed on the last iteration > > of the series). > > It's not "blanket" capabilities. You control types or maps and > programs that could be created. And again, CAP_SYS_ADMIN guarded. > Please, don't give CAP_SYS_ADMIN/root permissions to applications you > can't be sure won't do something stupid and blame kernel API for it. > > After all, the root process can setuid() any file and make it run with > elevated permissions, right? Doesn't get more "blanket" than that. > > > > > So I'm curious whether you considered this as an alternative to > > BPF_TOKEN? And if so, what your reason was for rejecting it? > > > > Yes, I'm aware, Christian has a follow up short blog post specifically > for using this for proxying BPF from privileged process ([0]). > > So, in short, I think it's not a good generic solution. It's very > fragile and high-maintenance. It's still proxying BPF UAPI (except > application does preserve illusion of using BPF syscall, yes, that > part is good) with all the implications: needing to replicate all of > UAPI (fetching all those FDs from another process, following all the > pointers from another process' memory, etc), and also writing back all > the correct things (into another process' memory): log content, > log_true_size (out param), any other output parameters. What do we do > when an application uses a newer version of bpf_attr that is supported > by proxy? And honestly, I'm like 99% sure there are lots of less > obvious issues one runs into when starting implementing something like > this. > > This sounds like a hack and nightmare to implement and support. > Perhaps that indirectly is supported by the fact that even Christian > half-jokingly calls this a crazy approach. That code basically is > unchanged for the last three years, with only one fix from Christian > one year after initial introduction ([1]) to fix a quirky issue > related to the limitation of pidfd working only for thread group > leaders. It also still supports only BPF_PROG_TYPE_CGROUP_DEVICE > program loading, it doesn't support a bunch of newer BPF_PROG_LOAD > fields and functionality, etc, etc. > > So as a technical curiosity it's pretty cool and perhaps is the right > tool for the job for very narrow specific use cases. But as a > realistic generic approach that could be used by industry at large for > safe BPF usage from namespaced containers -- not so much. Some background... When BPF & cgroup moved the devices cgroup from a file-based cgroup controller into a BPF program it was technically an immediate widespread regression. The cgroup v1 controller was file based and supported seemlessly switching between allow- and denylists. Whether that was ever sensible is a separate question. But what this meant was that any container runtime that used a simple file-based mechanism now had to generate a BPF device program that mirrored the cgroup v1 semantic such that the old syntax of the cgroup v1 device controller would be correctly translated into a BPF devices program. In addition, this broke some nesting scenarios. So intercepting bpf() via seccomp was specifically done to avoid devices cgroup regressions. It was never meant to be a generic solution. It also doesn't work for all cases as the seccomp notifier's supervision mechanism isn't really a clean solution. It's a pipe dream that you can transparently proxy system calls for another process via seccomp for sufficiently complex system calls. We did it for specific use-cases where we could sufficiently guarantee that they could be safe. But to make this work it would involve way more invasive changes: * nesting/stacking of seccomp notifiers * clean handling of pointer arguments in-kernel such that you can safely continue system calls being sure that they haven't been modified. This is currently only possible in scenarios where safety is guaranteed by the kernel refusing nonsensical or unsafe arguments * correct privilege handling The seccomp notifier emulates system calls in userspace and thus has to mimick the privilege context of the task it is emulating the system call for in such a way that (i) it allows it to succeed by avoiding the privilege limitations of why the given system call was supposed to be proxied in the first place, (ii) it doesn't allow to circumvent other, generic restrictions that would otherwise cause the system call to fail. It's like saying e.g., "execute with most of the proxied task's creds but let it have a few more privileges". That's frail as Linux creds aren't really composable. That's why we have override_creds() not "add_creds()" and "subtract_creds()" which would probably be nicer. Or it would have to be a generic first class kernel proxy which begs the question why not change the subsystems itself to do this cleanly.
On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote: > Andrii Nakryiko <andrii@kernel.org> writes: > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token, as different production setups can and do achieve it > > through a combination of different means (signing, LSM, code reviews, etc), > > and it's undesirable and infeasible for kernel to enforce any particular way > > of validating trustworthiness of particular process. > > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > So a colleague pointed out today that the Seccomp Notify functionality > would be a way to achieve your stated goal of allowing unprivileged > containers to (selectively) perform bpf() syscall operations. Christian > Brauner has a pretty nice writeup of the functionality here: > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development I'm amazed you read this. :) The seccomp notifier comes with a lot of caveats. I think it would be impractical if not infeasible to handle bpf() delegation. > > In fact he even mentions allowing unprivileged access to bpf() as a > possible use case (in the second-to-last paragraph). Yeah, I tried to work around a userspace regression with the introduction of the cgroup v2 devices controller.
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Andrii Nakryiko <andrii@kernel.org> writes: >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate >> > a subset of BPF functionality from privileged system-wide daemon (e.g., >> > systemd or any other container manager) to a *trusted* unprivileged >> > application. Trust is the key here. This functionality is not about allowing >> > unconditional unprivileged BPF usage. Establishing trust, though, is >> > completely up to the discretion of respective privileged application that >> > would create a BPF token, as different production setups can and do achieve it >> > through a combination of different means (signing, LSM, code reviews, etc), >> > and it's undesirable and infeasible for kernel to enforce any particular way >> > of validating trustworthiness of particular process. >> > >> > The main motivation for BPF token is a desire to enable containerized >> > BPF applications to be used together with user namespaces. This is currently >> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced >> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF >> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read >> > arbitrary memory, and it's impossible to ensure that they only read memory of >> > processes belonging to any given namespace. This means that it's impossible to >> > have namespace-aware CAP_BPF capability, and as such another mechanism to >> > allow safe usage of BPF functionality is necessary. BPF token and delegation >> > of it to a trusted unprivileged applications is such mechanism. Kernel makes >> > no assumption about what "trusted" constitutes in any particular case, and >> > it's up to specific privileged applications and their surrounding >> > infrastructure to decide that. What kernel provides is a set of APIs to create >> > and tune BPF token, and pass it around to privileged BPF commands that are >> > creating new BPF objects like BPF programs, BPF maps, etc. >> >> So a colleague pointed out today that the Seccomp Notify functionality >> would be a way to achieve your stated goal of allowing unprivileged >> containers to (selectively) perform bpf() syscall operations. Christian >> Brauner has a pretty nice writeup of the functionality here: >> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development >> >> In fact he even mentions allowing unprivileged access to bpf() as a >> possible use case (in the second-to-last paragraph). >> >> AFAICT this would enable your use case without adding any new kernel >> functionality or changing the BPF-using applications, while allowing the >> privileged userspace daemon to make case-by-case decisions on each >> operation instead of granting blanket capabilities (which is my main >> objection to the token proposal, as we discussed on the last iteration >> of the series). > > It's not "blanket" capabilities. You control types or maps and > programs that could be created. And again, CAP_SYS_ADMIN guarded. > Please, don't give CAP_SYS_ADMIN/root permissions to applications you > can't be sure won't do something stupid and blame kernel API for it. Right, I didn't mean "blanket" in the sense of "permission to do anything on the system"; I do get that you can restrict which subset of functionality you grant. However, *within* that subset, it's a blanket permission grant. I.e., you can't issue a token that grants a *specific* application permission to load a *specific* BPF program - you can only grant a general "load any program" permission that can be used by anyone who possesses the token. I guess we could in principle extend the token mechanism to allow this, but the kernel doesn't seem like the right place to implement such a fine-grained policy engine... > After all, the root process can setuid() any file and make it run with > elevated permissions, right? Doesn't get more "blanket" than that. Which is exactly why setuid binaries are not generally how we implement security delegation these days. So I don't think designing a new mechanism this way is a good idea. >> So I'm curious whether you considered this as an alternative to >> BPF_TOKEN? And if so, what your reason was for rejecting it? >> > > Yes, I'm aware, Christian has a follow up short blog post specifically > for using this for proxying BPF from privileged process ([0]). > > So, in short, I think it's not a good generic solution. It's very > fragile and high-maintenance. It's still proxying BPF UAPI (except > application does preserve illusion of using BPF syscall, yes, that > part is good) with all the implications: needing to replicate all of > UAPI (fetching all those FDs from another process, following all the > pointers from another process' memory, etc), and also writing back all > the correct things (into another process' memory): log content, > log_true_size (out param), any other output parameters. Right, OK, that bit does sound pretty tedious (although I'll note that there are people who are trying to make all this generally more palatable[0]). However, all that tediousness could be avoided while still retaining the model of blocking the syscall and asking a userspace policy daemon to supply a verdict. This could even be done using the same token mechanism: instead of attaching a permission to the token itself, just make it an opaque identifier. Then, when a syscall is made that contains the token, block it and send a notification to user space and use the verdict that comes back in place of the token "value". The notification could go through the same file descriptor (using read/write or an ioctl, restricted to CAP_SYS_ADMIN), or it could be a separate one that is returned alongside it on TOKEN_CREATE. The notification could include all of the syscall args or a subset, depending on the command, but the kernel can ensure there are no TOCTOU races, and no need for the policy daemon to go poking into other another process' namespace. Actually, using this model I don't think we would even strictly speaking need the explicit token FD to be included by the calling application inside the container at all? I.e., if the system policy daemon could just instruct the kernel "please delegate all permission decisions for this user namespace to me", it could - so to speak - issue tokens on demand as each call is made, instead of ahead of time. Which would both enable the policy daemon to make specific usage decisions, and wouldn't require any change needed to the applications using BPF inside the container (not even to include the BPF token FD). -Toke
Christian Brauner <brauner@kernel.org> writes: > On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote: >> Andrii Nakryiko <andrii@kernel.org> writes: >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate >> > a subset of BPF functionality from privileged system-wide daemon (e.g., >> > systemd or any other container manager) to a *trusted* unprivileged >> > application. Trust is the key here. This functionality is not about allowing >> > unconditional unprivileged BPF usage. Establishing trust, though, is >> > completely up to the discretion of respective privileged application that >> > would create a BPF token, as different production setups can and do achieve it >> > through a combination of different means (signing, LSM, code reviews, etc), >> > and it's undesirable and infeasible for kernel to enforce any particular way >> > of validating trustworthiness of particular process. >> > >> > The main motivation for BPF token is a desire to enable containerized >> > BPF applications to be used together with user namespaces. This is currently >> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced >> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF >> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read >> > arbitrary memory, and it's impossible to ensure that they only read memory of >> > processes belonging to any given namespace. This means that it's impossible to >> > have namespace-aware CAP_BPF capability, and as such another mechanism to >> > allow safe usage of BPF functionality is necessary. BPF token and delegation >> > of it to a trusted unprivileged applications is such mechanism. Kernel makes >> > no assumption about what "trusted" constitutes in any particular case, and >> > it's up to specific privileged applications and their surrounding >> > infrastructure to decide that. What kernel provides is a set of APIs to create >> > and tune BPF token, and pass it around to privileged BPF commands that are >> > creating new BPF objects like BPF programs, BPF maps, etc. >> >> So a colleague pointed out today that the Seccomp Notify functionality >> would be a way to achieve your stated goal of allowing unprivileged >> containers to (selectively) perform bpf() syscall operations. Christian >> Brauner has a pretty nice writeup of the functionality here: >> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development > > I'm amazed you read this. :) I found it quite an enjoyable read, actually :) > The seccomp notifier comes with a lot of caveats. I think it would be > impractical if not infeasible to handle bpf() delegation. Right, thank you for chiming in and explaining the context. I replied elsewhere in the thread on the content, so let's not fork the discussion any more than we have to... -Toke
On Wed, 05 Jul 2023 01:20:22 +0200 Toke Høiland-Jørgensen <toke@redhat.com> wrote: > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > On Thu, Jun 29, 2023 at 4:15 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> Andrii Nakryiko <andrii@kernel.org> writes: > >> > >> > This patch set introduces new BPF object, BPF token, which allows to delegate > >> > a subset of BPF functionality from privileged system-wide daemon (e.g., > >> > systemd or any other container manager) to a *trusted* unprivileged > >> > application. Trust is the key here. This functionality is not about allowing > >> > unconditional unprivileged BPF usage. Establishing trust, though, is > >> > completely up to the discretion of respective privileged application that > >> > would create a BPF token, as different production setups can and do achieve it > >> > through a combination of different means (signing, LSM, code reviews, etc), > >> > and it's undesirable and infeasible for kernel to enforce any particular way > >> > of validating trustworthiness of particular process. > >> > > >> > The main motivation for BPF token is a desire to enable containerized > >> > BPF applications to be used together with user namespaces. This is currently > >> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > >> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > >> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > >> > arbitrary memory, and it's impossible to ensure that they only read memory of > >> > processes belonging to any given namespace. This means that it's impossible to > >> > have namespace-aware CAP_BPF capability, and as such another mechanism to > >> > allow safe usage of BPF functionality is necessary. BPF token and delegation > >> > of it to a trusted unprivileged applications is such mechanism. Kernel makes > >> > no assumption about what "trusted" constitutes in any particular case, and > >> > it's up to specific privileged applications and their surrounding > >> > infrastructure to decide that. What kernel provides is a set of APIs to create > >> > and tune BPF token, and pass it around to privileged BPF commands that are > >> > creating new BPF objects like BPF programs, BPF maps, etc. > >> > >> So a colleague pointed out today that the Seccomp Notify functionality > >> would be a way to achieve your stated goal of allowing unprivileged > >> containers to (selectively) perform bpf() syscall operations. Christian > >> Brauner has a pretty nice writeup of the functionality here: > >> https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development > >> > >> In fact he even mentions allowing unprivileged access to bpf() as a > >> possible use case (in the second-to-last paragraph). > >> > >> AFAICT this would enable your use case without adding any new kernel > >> functionality or changing the BPF-using applications, while allowing the > >> privileged userspace daemon to make case-by-case decisions on each > >> operation instead of granting blanket capabilities (which is my main > >> objection to the token proposal, as we discussed on the last iteration > >> of the series). > > > > It's not "blanket" capabilities. You control types or maps and > > programs that could be created. And again, CAP_SYS_ADMIN guarded. > > Please, don't give CAP_SYS_ADMIN/root permissions to applications you > > can't be sure won't do something stupid and blame kernel API for it. > > Right, I didn't mean "blanket" in the sense of "permission to do > anything on the system"; I do get that you can restrict which subset of > functionality you grant. However, *within* that subset, it's a blanket > permission grant. I.e., you can't issue a token that grants a *specific* > application permission to load a *specific* BPF program - you can only > grant a general "load any program" permission that can be used by anyone > who possesses the token. > > I guess we could in principle extend the token mechanism to allow this, > but the kernel doesn't seem like the right place to implement such a > fine-grained policy engine... > > > After all, the root process can setuid() any file and make it run with > > elevated permissions, right? Doesn't get more "blanket" than that. > > Which is exactly why setuid binaries are not generally how we implement > security delegation these days. So I don't think designing a new > mechanism this way is a good idea. > > >> So I'm curious whether you considered this as an alternative to > >> BPF_TOKEN? And if so, what your reason was for rejecting it? > >> > > > > Yes, I'm aware, Christian has a follow up short blog post specifically > > for using this for proxying BPF from privileged process ([0]). > > > > So, in short, I think it's not a good generic solution. It's very > > fragile and high-maintenance. It's still proxying BPF UAPI (except > > application does preserve illusion of using BPF syscall, yes, that > > part is good) with all the implications: needing to replicate all of > > UAPI (fetching all those FDs from another process, following all the > > pointers from another process' memory, etc), and also writing back all > > the correct things (into another process' memory): log content, > > log_true_size (out param), any other output parameters. > > Right, OK, that bit does sound pretty tedious (although I'll note that > there are people who are trying to make all this generally more > palatable[0]). [0] https://seitan.rocks/ :) Some clickbaiting for Christian: the presentation we gave a couple of weeks ago, also linked from the project website, actually credits you (slide 29/30, of course). The code is still very much draft quality (we mostly focused on demos/feasibility so far, cleaning it up now), and we didn't prove (at least not yet) that handling complicated stuff such as bpf(2) is actually convenient, but that's at least in scope as a stretch goal. I'm not claiming it's doable, but we'd give it a try. What we have at the moment is a meagre set of eight syscall models, some blatantly incomplete. A couple of comments to specific points Christian mentioned: On Tue, 4 Jul 2023 11:38:38 +0200 Christian Brauner <brauner@kernel.org> wrote: > It's a pipe dream that you can transparently proxy system calls for > another process via seccomp for sufficiently complex system calls. We > did it for specific use-cases where we could sufficiently guarantee that > they could be safe. Right, so we're trying to pick it up from there. It's way too early to claim success, but I thought it would make sense to chime in anyway. > But to make this work it would involve way more invasive changes: > > * nesting/stacking of seccomp notifiers The need for stacked seccomp filters is obvious to me and that works more or less naturally. But why would you actually need to stack, or especially nest *notifiers* themselves? > * clean handling of pointer arguments in-kernel such that you can safely > continue system calls being sure that they haven't been modified. This > is currently only possible in scenarios where safety is guaranteed by > the kernel refusing nonsensical or unsafe arguments We're considering a couple of options. One is to never use SECCOMP_USER_NOTIF_FLAG_CONTINUE for system calls accepting pointers, or only allowing that as an explicit "unsafe" option. For a "safe" implementation, the supervisor (seitan) would in any case replay the system call, matching the context (namespaces, credentials) of the target process. If PID or TID (per se, not in terms of associated context/capabilities) of the caller matter for a specific system call, though, we simply can't support that. But that shouldn't actually be relevant for bpf(2). Strictly speaking, I think it's actually possible to "fix" this in the kernel by means of checking or copying memory that's addressable by a thread, but that might prove too invasive or end up in insurmountable layering violations. This mechanism would involve "control" paths rather than data paths, though, so the performance impact is not really worrying. Another option, which we outlined at this very convenient link: https://github.com/alicefr/community/blob/seitan/design-proposals/seitan/security-aspects-seitan.md#if-i-use-the-json-model-as-a-security-filter-can-another-thread-in-the-same-process-context-write-to-the-memory-area-pointed-to-by-system-call-arguments-while-the-calling-thread-is-blocked-and-defy-the-purpose-of-the-filter would be to make the supervisor perform a deep copy (system calls are anyway modeled in the seitan-cooker component) and then use good old ptrace(2) as needed. > * correct privilege handling > The seccomp notifier emulates system calls in userspace and thus has > to mimick the privilege context of the task it is emulating the system > call for in such a way that (i) it allows it to succeed by avoiding the > privilege limitations of why the given system call was supposed to be > proxied in the first place, (ii) it doesn't allow to circumvent other, > generic restrictions that would otherwise cause the system call to > fail. It's like saying e.g., "execute with most of the proxied task's > creds but let it have a few more privileges". That's frail as Linux > creds aren't really composable. That's why we have override_creds() > not "add_creds()" and "subtract_creds()" which would probably be > nicer. Right, at the moment we just run that as root, but we plan to take care of (ii) (albeit not solving it entirely, I guess), by at least applying a seccomp filter to the supervisor itself. As to the set of (composed?) capabilities, we don't have an answer yet. > Or it would have to be a generic first class kernel proxy which begs the > question why not change the subsystems itself to do this cleanly. Well, the fine-grained "policy" implementation we're trying to achieve looks to me like something that's a bit too complicated for the kernel, and really more appropriate for userspace.
On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token, as different production setups can and do achieve it > > through a combination of different means (signing, LSM, code reviews, etc), > > and it's undesirable and infeasible for kernel to enforce any particular way > > of validating trustworthiness of particular process. > > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > LSM maintainers. BPF token concept is not changing anything about LSM > > approach, but can be combined with LSM hooks for very fine-grained security > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > (context), which in combination with BPF LSM would allow implementing a very > > dynamic and fine-granular custom security policies on top of BPF token. In the > > interest of minimizing API surface area discussions this is going to be > > added in follow up patches, as it's not essential to the fundamental concept > > of delegatable BPF token. > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > difference is the idea of using virtual anon_inode file to hold BPF token and > > allowing multiple independent instances of them, each with its own set of > > restrictions. BPF pinning solves the problem of exposing such BPF token > > through file system (BPF FS, in this case) for cases where transferring FDs > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > syscall accepts token_fd parameters explicitly for each relevant BPF command. > > This addresses main concerns brought up during the /dev/bpf discussion, and > > fits better with overall BPF subsystem design. > > > > This patch set adds a basic minimum of functionality to make BPF token useful > > and to discuss API and functionality. Currently only low-level libbpf APIs > > support passing BPF token around, allowing to test kernel functionality, but > > for the most part is not sufficient for real-world applications, which > > typically use high-level libbpf APIs based on `struct bpf_object` type. This > > was done with the intent to limit the size of patch set and concentrate on > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > > as a separate follow up patch set kernel support makes it upstream. > > > > Another part that should happen once kernel-side BPF token is established, is > > a set of conventions between applications (e.g., systemd), tools (e.g., > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > > at well-defined locations to allow applications take advantage of this in > > automatic fashion without explicit code changes on BPF application's side. > > But I'd like to postpone this discussion to after BPF token concept lands. > > > > Once important distinctions from v2 that should be noted is a chance in the > > semantics of a newly added BPF_TOKEN_CREATE command. Previously, > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF > > token object creation *and* pinning in BPF FS. Such change ensures that BPF > > token is always associated with a specific instance of BPF FS and cannot > > "escape" it by application re-pinning it somewhere else using another > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation, > > better containing it inside intended container (under assumption BPF FS is set > > up in such a way as to not be shared with other containers on the system). > > > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > > > > v3->v3-resend: > > - I started integrating token_fd into bpf_object_open_opts and higher-level > > libbpf bpf_object APIs, but it started going a bit deeper into bpf_object > > implementation details and how libbpf performs feature detection and > > caching, so I decided to keep it separate from this patch set and not > > distract from the mostly kernel-side changes; > > v2->v3: > > - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow > > BPF_OBJ_PIN for BPF token; > > v1->v2: > > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). > > > > Andrii Nakryiko (14): > > bpf: introduce BPF token object > > libbpf: add bpf_token_create() API > > selftests/bpf: add BPF_TOKEN_CREATE test > > bpf: add BPF token support to BPF_MAP_CREATE command > > libbpf: add BPF token support to bpf_map_create() API > > selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command > > bpf: add BPF token support to BPF_BTF_LOAD command > > libbpf: add BPF token support to bpf_btf_load() API > > selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest > > bpf: add BPF token support to BPF_PROG_LOAD command > > bpf: take into account BPF token when fetching helper protos > > bpf: consistenly use BPF token throughout BPF verifier logic > > libbpf: add BPF token support to bpf_prog_load() API > > selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests > > > > drivers/media/rc/bpf-lirc.c | 2 +- > > include/linux/bpf.h | 79 ++++- > > include/linux/filter.h | 2 +- > > include/uapi/linux/bpf.h | 53 ++++ > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/arraymap.c | 2 +- > > kernel/bpf/cgroup.c | 6 +- > > kernel/bpf/core.c | 3 +- > > kernel/bpf/helpers.c | 6 +- > > kernel/bpf/inode.c | 46 ++- > > kernel/bpf/syscall.c | 183 +++++++++--- > > kernel/bpf/token.c | 201 +++++++++++++ > > kernel/bpf/verifier.c | 13 +- > > kernel/trace/bpf_trace.c | 2 +- > > net/core/filter.c | 36 +-- > > net/ipv4/bpf_tcp_ca.c | 2 +- > > net/netfilter/nf_bpf_link.c | 2 +- > > tools/include/uapi/linux/bpf.h | 53 ++++ > > tools/lib/bpf/bpf.c | 35 ++- > > tools/lib/bpf/bpf.h | 45 ++- > > tools/lib/bpf/libbpf.map | 1 + > > .../selftests/bpf/prog_tests/libbpf_probes.c | 4 + > > .../selftests/bpf/prog_tests/libbpf_str.c | 6 + > > .../testing/selftests/bpf/prog_tests/token.c | 277 ++++++++++++++++++ > > 24 files changed, 957 insertions(+), 104 deletions(-) > > create mode 100644 kernel/bpf/token.c > > create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c > > > > -- > > 2.34.1 > > > > > > > Hi Andrii, > > Thanks for your proposal. > That seems to be a useful functionality, and yet I have some questions. I've answered them below. But I don't think either of them have any relation to BPF token and the problem I'm trying to solve. > > 1. Why can't we add security_bpf_probe_read_{kernel,user}? > If possible, we can use these LSM hooks to refuse the process to > read other tasks' information. E.g. if the other process is not within > the same cgroup or the same namespace, we just refuse the reading. I > think it is not hard to identify if the other process is within the > same cgroup or the same namespace. There are probably many reasons. First, performance-wide, LSM hook for each bpf_probe_read_{kernel,user}() call will be prohibitive. And just in general, one would need to be very careful with such LSM hooks, because bpf_probe_read_{kernel,user}() often happens from NMI context, and LSM policy would have to be written and validated very carefully with NMI context in mind. But, more conceptually, for probe_read you get a random address and you know the process context you are running in (but you might be actually running in softirq and NMI, and that process context is irrelevant). How can you efficiently (or at all) tell if that random address "belongs" to cgroup or namespace? Just at conceptual level? > > 2. Why can't we extend bpf_cookie? > We're now using bpf_cookie to identify each user or each > application, and only the permitted cookies can create new probe > links. However we find the bpf_cookie is only supported by tracing, > perf_event and kprobe_multi, so we're planning to extend it to other > possible link types, then we can use LSM hooks to control all bpf > links. I think that the upstream kernel should also support > bpf_cookie for all bpf links. If possible, we will post it to the > upstream in the future. > After I have read your BPF token proposal, I just have some other > ideas. Why can't we just extend bpf_cookie to all other BPF objects? > For example, all progs and maps should also have the bpf_cookie. > I'm not exactly clear how you use BPF cookie, but it wasn't intended to provide any sort of security or validation policy. It's purely a user-provided u64 to help distinguish different attach points when the same BPF program is attached in multiple places (e.g., kprobe tracing many different kernel functions and needing to distinguish between them at runtime). I do agree BPF cookie is super useful and we should keep extending other types of BPF programs with BPF cookie support, of course. It's just completely orthogonal to BPF token discussion. > > -- > Regards > Yafang
On Tue, Jul 4, 2023 at 2:52 AM Christian Brauner <brauner@kernel.org> wrote: > > On Fri, Jun 30, 2023 at 01:15:47AM +0200, Toke Høiland-Jørgensen wrote: > > Andrii Nakryiko <andrii@kernel.org> writes: > > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > > systemd or any other container manager) to a *trusted* unprivileged > > > application. Trust is the key here. This functionality is not about allowing > > > unconditional unprivileged BPF usage. Establishing trust, though, is > > > completely up to the discretion of respective privileged application that > > > would create a BPF token, as different production setups can and do achieve it > > > through a combination of different means (signing, LSM, code reviews, etc), > > > and it's undesirable and infeasible for kernel to enforce any particular way > > > of validating trustworthiness of particular process. > > > > > > The main motivation for BPF token is a desire to enable containerized > > > BPF applications to be used together with user namespaces. This is currently > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > > arbitrary memory, and it's impossible to ensure that they only read memory of > > > processes belonging to any given namespace. This means that it's impossible to > > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > > no assumption about what "trusted" constitutes in any particular case, and > > > it's up to specific privileged applications and their surrounding > > > infrastructure to decide that. What kernel provides is a set of APIs to create > > > and tune BPF token, and pass it around to privileged BPF commands that are > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > So a colleague pointed out today that the Seccomp Notify functionality > > would be a way to achieve your stated goal of allowing unprivileged > > containers to (selectively) perform bpf() syscall operations. Christian > > Brauner has a pretty nice writeup of the functionality here: > > https://people.kernel.org/brauner/the-seccomp-notifier-new-frontiers-in-unprivileged-container-development > > I'm amazed you read this. :) > The seccomp notifier comes with a lot of caveats. I think it would be > impractical if not infeasible to handle bpf() delegation. Thanks for confirming my hunch. And yeah, I read a bunch of blog posts from your blog post. The one about new mount APIs was especially useful given how little documentation I could find on them otherwise :) > > > > > In fact he even mentions allowing unprivileged access to bpf() as a > > possible use case (in the second-to-last paragraph). > > Yeah, I tried to work around a userspace regression with the > introduction of the cgroup v2 devices controller.
On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > > systemd or any other container manager) to a *trusted* unprivileged > > > application. Trust is the key here. This functionality is not about allowing > > > unconditional unprivileged BPF usage. Establishing trust, though, is > > > completely up to the discretion of respective privileged application that > > > would create a BPF token, as different production setups can and do achieve it > > > through a combination of different means (signing, LSM, code reviews, etc), > > > and it's undesirable and infeasible for kernel to enforce any particular way > > > of validating trustworthiness of particular process. > > > > > > The main motivation for BPF token is a desire to enable containerized > > > BPF applications to be used together with user namespaces. This is currently > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > > arbitrary memory, and it's impossible to ensure that they only read memory of > > > processes belonging to any given namespace. This means that it's impossible to > > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > > no assumption about what "trusted" constitutes in any particular case, and > > > it's up to specific privileged applications and their surrounding > > > infrastructure to decide that. What kernel provides is a set of APIs to create > > > and tune BPF token, and pass it around to privileged BPF commands that are > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > approach, but can be combined with LSM hooks for very fine-grained security > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > (context), which in combination with BPF LSM would allow implementing a very > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > interest of minimizing API surface area discussions this is going to be > > > added in follow up patches, as it's not essential to the fundamental concept > > > of delegatable BPF token. > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > allowing multiple independent instances of them, each with its own set of > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > syscall accepts token_fd parameters explicitly for each relevant BPF command. > > > This addresses main concerns brought up during the /dev/bpf discussion, and > > > fits better with overall BPF subsystem design. > > > > > > This patch set adds a basic minimum of functionality to make BPF token useful > > > and to discuss API and functionality. Currently only low-level libbpf APIs > > > support passing BPF token around, allowing to test kernel functionality, but > > > for the most part is not sufficient for real-world applications, which > > > typically use high-level libbpf APIs based on `struct bpf_object` type. This > > > was done with the intent to limit the size of patch set and concentrate on > > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > > > as a separate follow up patch set kernel support makes it upstream. > > > > > > Another part that should happen once kernel-side BPF token is established, is > > > a set of conventions between applications (e.g., systemd), tools (e.g., > > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > > > at well-defined locations to allow applications take advantage of this in > > > automatic fashion without explicit code changes on BPF application's side. > > > But I'd like to postpone this discussion to after BPF token concept lands. > > > > > > Once important distinctions from v2 that should be noted is a chance in the > > > semantics of a newly added BPF_TOKEN_CREATE command. Previously, > > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to > > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN > > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF > > > token object creation *and* pinning in BPF FS. Such change ensures that BPF > > > token is always associated with a specific instance of BPF FS and cannot > > > "escape" it by application re-pinning it somewhere else using another > > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation, > > > better containing it inside intended container (under assumption BPF FS is set > > > up in such a way as to not be shared with other containers on the system). > > > > > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > > > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > > > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > > > > > > v3->v3-resend: > > > - I started integrating token_fd into bpf_object_open_opts and higher-level > > > libbpf bpf_object APIs, but it started going a bit deeper into bpf_object > > > implementation details and how libbpf performs feature detection and > > > caching, so I decided to keep it separate from this patch set and not > > > distract from the mostly kernel-side changes; > > > v2->v3: > > > - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow > > > BPF_OBJ_PIN for BPF token; > > > v1->v2: > > > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > > > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). > > > > > > Andrii Nakryiko (14): > > > bpf: introduce BPF token object > > > libbpf: add bpf_token_create() API > > > selftests/bpf: add BPF_TOKEN_CREATE test > > > bpf: add BPF token support to BPF_MAP_CREATE command > > > libbpf: add BPF token support to bpf_map_create() API > > > selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command > > > bpf: add BPF token support to BPF_BTF_LOAD command > > > libbpf: add BPF token support to bpf_btf_load() API > > > selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest > > > bpf: add BPF token support to BPF_PROG_LOAD command > > > bpf: take into account BPF token when fetching helper protos > > > bpf: consistenly use BPF token throughout BPF verifier logic > > > libbpf: add BPF token support to bpf_prog_load() API > > > selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests > > > > > > drivers/media/rc/bpf-lirc.c | 2 +- > > > include/linux/bpf.h | 79 ++++- > > > include/linux/filter.h | 2 +- > > > include/uapi/linux/bpf.h | 53 ++++ > > > kernel/bpf/Makefile | 2 +- > > > kernel/bpf/arraymap.c | 2 +- > > > kernel/bpf/cgroup.c | 6 +- > > > kernel/bpf/core.c | 3 +- > > > kernel/bpf/helpers.c | 6 +- > > > kernel/bpf/inode.c | 46 ++- > > > kernel/bpf/syscall.c | 183 +++++++++--- > > > kernel/bpf/token.c | 201 +++++++++++++ > > > kernel/bpf/verifier.c | 13 +- > > > kernel/trace/bpf_trace.c | 2 +- > > > net/core/filter.c | 36 +-- > > > net/ipv4/bpf_tcp_ca.c | 2 +- > > > net/netfilter/nf_bpf_link.c | 2 +- > > > tools/include/uapi/linux/bpf.h | 53 ++++ > > > tools/lib/bpf/bpf.c | 35 ++- > > > tools/lib/bpf/bpf.h | 45 ++- > > > tools/lib/bpf/libbpf.map | 1 + > > > .../selftests/bpf/prog_tests/libbpf_probes.c | 4 + > > > .../selftests/bpf/prog_tests/libbpf_str.c | 6 + > > > .../testing/selftests/bpf/prog_tests/token.c | 277 ++++++++++++++++++ > > > 24 files changed, 957 insertions(+), 104 deletions(-) > > > create mode 100644 kernel/bpf/token.c > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c > > > > > > -- > > > 2.34.1 > > > > > > > > > > > > Hi Andrii, > > > > Thanks for your proposal. > > That seems to be a useful functionality, and yet I have some questions. > > I've answered them below. But I don't think either of them have any > relation to BPF token and the problem I'm trying to solve. > > > > > 1. Why can't we add security_bpf_probe_read_{kernel,user}? > > If possible, we can use these LSM hooks to refuse the process to > > read other tasks' information. E.g. if the other process is not within > > the same cgroup or the same namespace, we just refuse the reading. I > > think it is not hard to identify if the other process is within the > > same cgroup or the same namespace. > > There are probably many reasons. First, performance-wide, LSM hook for > each bpf_probe_read_{kernel,user}() call will be prohibitive. And just > in general, one would need to be very careful with such LSM hooks, > because bpf_probe_read_{kernel,user}() often happens from NMI context, > and LSM policy would have to be written and validated very carefully > with NMI context in mind. > > But, more conceptually, for probe_read you get a random address and > you know the process context you are running in (but you might be > actually running in softirq and NMI, and that process context is > irrelevant). How can you efficiently (or at all) tell if that random > address "belongs" to cgroup or namespace? Just at conceptual level? > > > > > 2. Why can't we extend bpf_cookie? > > We're now using bpf_cookie to identify each user or each > > application, and only the permitted cookies can create new probe > > links. However we find the bpf_cookie is only supported by tracing, > > perf_event and kprobe_multi, so we're planning to extend it to other > > possible link types, then we can use LSM hooks to control all bpf > > links. I think that the upstream kernel should also support > > bpf_cookie for all bpf links. If possible, we will post it to the > > upstream in the future. > > After I have read your BPF token proposal, I just have some other > > ideas. Why can't we just extend bpf_cookie to all other BPF objects? > > For example, all progs and maps should also have the bpf_cookie. > > > > I'm not exactly clear how you use BPF cookie, but it wasn't intended > to provide any sort of security or validation policy. It's purely a > user-provided u64 to help distinguish different attach points when the > same BPF program is attached in multiple places (e.g., kprobe tracing > many different kernel functions and needing to distinguish between > them at runtime). In our container environment, we enable the CAP_BPF, CAP_PERMON and CAP_NET_ADMIN for the containers which want to run BPF programs inside. However we don't want them to run whatever BPF programs they want. We only allow them to run the BPF programs we have permitted for each of them. So we are using LSM to audit the BPF behavior such as prog load, map creation and link attach. We define different BPF policies for different containers. In order to identify different containers efficiently, we assign different bpf_cookies for different containers. bpf_cookie is a u64, that's enough for our use cases. We didn't use cgroup id to identify different containers because cgroup id is a local value in a server, while bpf_cookie is a global value, that would be easy for deployment. For your use cases, maybe we could enable CAP_BPF (+CAP_PERMON, +CAP_NET_ADMIN) for all users, and then we assign different bpf_cookies for different users, so we can use LSM to allow the user who have the permitted cookies to run BPF program ? > > I do agree BPF cookie is super useful and we should keep extending > other types of BPF programs with BPF cookie support, of course. It's > just completely orthogonal to BPF token discussion. >
On Wed, Jul 5, 2023 at 6:27 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > > > systemd or any other container manager) to a *trusted* unprivileged > > > > application. Trust is the key here. This functionality is not about allowing > > > > unconditional unprivileged BPF usage. Establishing trust, though, is > > > > completely up to the discretion of respective privileged application that > > > > would create a BPF token, as different production setups can and do achieve it > > > > through a combination of different means (signing, LSM, code reviews, etc), > > > > and it's undesirable and infeasible for kernel to enforce any particular way > > > > of validating trustworthiness of particular process. > > > > > > > > The main motivation for BPF token is a desire to enable containerized > > > > BPF applications to be used together with user namespaces. This is currently > > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > > > arbitrary memory, and it's impossible to ensure that they only read memory of > > > > processes belonging to any given namespace. This means that it's impossible to > > > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > > > no assumption about what "trusted" constitutes in any particular case, and > > > > it's up to specific privileged applications and their surrounding > > > > infrastructure to decide that. What kernel provides is a set of APIs to create > > > > and tune BPF token, and pass it around to privileged BPF commands that are > > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > > approach, but can be combined with LSM hooks for very fine-grained security > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > > (context), which in combination with BPF LSM would allow implementing a very > > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > > interest of minimizing API surface area discussions this is going to be > > > > added in follow up patches, as it's not essential to the fundamental concept > > > > of delegatable BPF token. > > > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > > allowing multiple independent instances of them, each with its own set of > > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > > syscall accepts token_fd parameters explicitly for each relevant BPF command. > > > > This addresses main concerns brought up during the /dev/bpf discussion, and > > > > fits better with overall BPF subsystem design. > > > > > > > > This patch set adds a basic minimum of functionality to make BPF token useful > > > > and to discuss API and functionality. Currently only low-level libbpf APIs > > > > support passing BPF token around, allowing to test kernel functionality, but > > > > for the most part is not sufficient for real-world applications, which > > > > typically use high-level libbpf APIs based on `struct bpf_object` type. This > > > > was done with the intent to limit the size of patch set and concentrate on > > > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > > > > as a separate follow up patch set kernel support makes it upstream. > > > > > > > > Another part that should happen once kernel-side BPF token is established, is > > > > a set of conventions between applications (e.g., systemd), tools (e.g., > > > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > > > > at well-defined locations to allow applications take advantage of this in > > > > automatic fashion without explicit code changes on BPF application's side. > > > > But I'd like to postpone this discussion to after BPF token concept lands. > > > > > > > > Once important distinctions from v2 that should be noted is a chance in the > > > > semantics of a newly added BPF_TOKEN_CREATE command. Previously, > > > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to > > > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN > > > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF > > > > token object creation *and* pinning in BPF FS. Such change ensures that BPF > > > > token is always associated with a specific instance of BPF FS and cannot > > > > "escape" it by application re-pinning it somewhere else using another > > > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation, > > > > better containing it inside intended container (under assumption BPF FS is set > > > > up in such a way as to not be shared with other containers on the system). > > > > > > > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > > > > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > > > > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > > > > > > > > v3->v3-resend: > > > > - I started integrating token_fd into bpf_object_open_opts and higher-level > > > > libbpf bpf_object APIs, but it started going a bit deeper into bpf_object > > > > implementation details and how libbpf performs feature detection and > > > > caching, so I decided to keep it separate from this patch set and not > > > > distract from the mostly kernel-side changes; > > > > v2->v3: > > > > - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow > > > > BPF_OBJ_PIN for BPF token; > > > > v1->v2: > > > > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > > > > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). > > > > > > > > Andrii Nakryiko (14): > > > > bpf: introduce BPF token object > > > > libbpf: add bpf_token_create() API > > > > selftests/bpf: add BPF_TOKEN_CREATE test > > > > bpf: add BPF token support to BPF_MAP_CREATE command > > > > libbpf: add BPF token support to bpf_map_create() API > > > > selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command > > > > bpf: add BPF token support to BPF_BTF_LOAD command > > > > libbpf: add BPF token support to bpf_btf_load() API > > > > selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest > > > > bpf: add BPF token support to BPF_PROG_LOAD command > > > > bpf: take into account BPF token when fetching helper protos > > > > bpf: consistenly use BPF token throughout BPF verifier logic > > > > libbpf: add BPF token support to bpf_prog_load() API > > > > selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests > > > > > > > > drivers/media/rc/bpf-lirc.c | 2 +- > > > > include/linux/bpf.h | 79 ++++- > > > > include/linux/filter.h | 2 +- > > > > include/uapi/linux/bpf.h | 53 ++++ > > > > kernel/bpf/Makefile | 2 +- > > > > kernel/bpf/arraymap.c | 2 +- > > > > kernel/bpf/cgroup.c | 6 +- > > > > kernel/bpf/core.c | 3 +- > > > > kernel/bpf/helpers.c | 6 +- > > > > kernel/bpf/inode.c | 46 ++- > > > > kernel/bpf/syscall.c | 183 +++++++++--- > > > > kernel/bpf/token.c | 201 +++++++++++++ > > > > kernel/bpf/verifier.c | 13 +- > > > > kernel/trace/bpf_trace.c | 2 +- > > > > net/core/filter.c | 36 +-- > > > > net/ipv4/bpf_tcp_ca.c | 2 +- > > > > net/netfilter/nf_bpf_link.c | 2 +- > > > > tools/include/uapi/linux/bpf.h | 53 ++++ > > > > tools/lib/bpf/bpf.c | 35 ++- > > > > tools/lib/bpf/bpf.h | 45 ++- > > > > tools/lib/bpf/libbpf.map | 1 + > > > > .../selftests/bpf/prog_tests/libbpf_probes.c | 4 + > > > > .../selftests/bpf/prog_tests/libbpf_str.c | 6 + > > > > .../testing/selftests/bpf/prog_tests/token.c | 277 ++++++++++++++++++ > > > > 24 files changed, 957 insertions(+), 104 deletions(-) > > > > create mode 100644 kernel/bpf/token.c > > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c > > > > > > > > -- > > > > 2.34.1 > > > > > > > > > > > > > > > > > Hi Andrii, > > > > > > Thanks for your proposal. > > > That seems to be a useful functionality, and yet I have some questions. > > > > I've answered them below. But I don't think either of them have any > > relation to BPF token and the problem I'm trying to solve. > > > > > > > > 1. Why can't we add security_bpf_probe_read_{kernel,user}? > > > If possible, we can use these LSM hooks to refuse the process to > > > read other tasks' information. E.g. if the other process is not within > > > the same cgroup or the same namespace, we just refuse the reading. I > > > think it is not hard to identify if the other process is within the > > > same cgroup or the same namespace. > > > > There are probably many reasons. First, performance-wide, LSM hook for > > each bpf_probe_read_{kernel,user}() call will be prohibitive. And just > > in general, one would need to be very careful with such LSM hooks, > > because bpf_probe_read_{kernel,user}() often happens from NMI context, > > and LSM policy would have to be written and validated very carefully > > with NMI context in mind. > > > > But, more conceptually, for probe_read you get a random address and > > you know the process context you are running in (but you might be > > actually running in softirq and NMI, and that process context is > > irrelevant). How can you efficiently (or at all) tell if that random > > address "belongs" to cgroup or namespace? Just at conceptual level? > > > > > > > > 2. Why can't we extend bpf_cookie? > > > We're now using bpf_cookie to identify each user or each > > > application, and only the permitted cookies can create new probe > > > links. However we find the bpf_cookie is only supported by tracing, > > > perf_event and kprobe_multi, so we're planning to extend it to other > > > possible link types, then we can use LSM hooks to control all bpf > > > links. I think that the upstream kernel should also support > > > bpf_cookie for all bpf links. If possible, we will post it to the > > > upstream in the future. > > > After I have read your BPF token proposal, I just have some other > > > ideas. Why can't we just extend bpf_cookie to all other BPF objects? > > > For example, all progs and maps should also have the bpf_cookie. > > > > > > > I'm not exactly clear how you use BPF cookie, but it wasn't intended > > to provide any sort of security or validation policy. It's purely a > > user-provided u64 to help distinguish different attach points when the > > same BPF program is attached in multiple places (e.g., kprobe tracing > > many different kernel functions and needing to distinguish between > > them at runtime). > > In our container environment, we enable the CAP_BPF, CAP_PERMON and > CAP_NET_ADMIN for the containers which want to run BPF programs > inside. However we don't want them to run whatever BPF programs they > want. We only allow them to run the BPF programs we have permitted for > each of them. So we are using LSM to audit the BPF behavior such as > prog load, map creation and link attach. We define different BPF > policies for different containers. In order to identify different > containers efficiently, we assign different bpf_cookies for different > containers. bpf_cookie is a u64, that's enough for our use cases. I can see how you can use BPF cookies for this, but it's certainly not an intended use case :) BPF cookie is most useful on BPF side of things. But what you are describing is meant to be doable with BPF token. It's not in first patch set, but I intended to allow user to specify an extra "user context" blog of bytes which would be stored with BPF token. And this data should be accessible from BPF LSM programs to make extra custom policy decisions. But we need to agree on initial BPF token stuff first, and then build out all the rest. > We didn't use cgroup id to identify different containers because > cgroup id is a local value in a server, while bpf_cookie is a global > value, that would be easy for deployment. > For your use cases, maybe we could enable CAP_BPF (+CAP_PERMON, > +CAP_NET_ADMIN) for all users, and then we assign different > bpf_cookies for different users, so we can use LSM to allow the user > who have the permitted cookies to run BPF program ? > > > > > I do agree BPF cookie is super useful and we should keep extending > > other types of BPF programs with BPF cookie support, of course. It's > > just completely orthogonal to BPF token discussion. > > > > -- > Regards > Yafang
On Fri, Jul 7, 2023 at 4:34 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Wed, Jul 5, 2023 at 6:27 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Thu, Jul 6, 2023 at 4:37 AM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Fri, Jun 30, 2023 at 7:06 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > On Thu, Jun 29, 2023 at 1:18 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > > > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > > > > systemd or any other container manager) to a *trusted* unprivileged > > > > > application. Trust is the key here. This functionality is not about allowing > > > > > unconditional unprivileged BPF usage. Establishing trust, though, is > > > > > completely up to the discretion of respective privileged application that > > > > > would create a BPF token, as different production setups can and do achieve it > > > > > through a combination of different means (signing, LSM, code reviews, etc), > > > > > and it's undesirable and infeasible for kernel to enforce any particular way > > > > > of validating trustworthiness of particular process. > > > > > > > > > > The main motivation for BPF token is a desire to enable containerized > > > > > BPF applications to be used together with user namespaces. This is currently > > > > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > > > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > > > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > > > > arbitrary memory, and it's impossible to ensure that they only read memory of > > > > > processes belonging to any given namespace. This means that it's impossible to > > > > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > > > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > > > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > > > > no assumption about what "trusted" constitutes in any particular case, and > > > > > it's up to specific privileged applications and their surrounding > > > > > infrastructure to decide that. What kernel provides is a set of APIs to create > > > > > and tune BPF token, and pass it around to privileged BPF commands that are > > > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > > > approach, but can be combined with LSM hooks for very fine-grained security > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > > > (context), which in combination with BPF LSM would allow implementing a very > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > > > interest of minimizing API surface area discussions this is going to be > > > > > added in follow up patches, as it's not essential to the fundamental concept > > > > > of delegatable BPF token. > > > > > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > > > allowing multiple independent instances of them, each with its own set of > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > > > syscall accepts token_fd parameters explicitly for each relevant BPF command. > > > > > This addresses main concerns brought up during the /dev/bpf discussion, and > > > > > fits better with overall BPF subsystem design. > > > > > > > > > > This patch set adds a basic minimum of functionality to make BPF token useful > > > > > and to discuss API and functionality. Currently only low-level libbpf APIs > > > > > support passing BPF token around, allowing to test kernel functionality, but > > > > > for the most part is not sufficient for real-world applications, which > > > > > typically use high-level libbpf APIs based on `struct bpf_object` type. This > > > > > was done with the intent to limit the size of patch set and concentrate on > > > > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > > > > > as a separate follow up patch set kernel support makes it upstream. > > > > > > > > > > Another part that should happen once kernel-side BPF token is established, is > > > > > a set of conventions between applications (e.g., systemd), tools (e.g., > > > > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > > > > > at well-defined locations to allow applications take advantage of this in > > > > > automatic fashion without explicit code changes on BPF application's side. > > > > > But I'd like to postpone this discussion to after BPF token concept lands. > > > > > > > > > > Once important distinctions from v2 that should be noted is a chance in the > > > > > semantics of a newly added BPF_TOKEN_CREATE command. Previously, > > > > > BPF_TOKEN_CREATE would create BPF token kernel object and return its FD to > > > > > user-space, allowing to (optionally) pin it in BPF FS using BPF_OBJ_PIN > > > > > command. This v3 version changes this slightly: BPF_TOKEN_CREATE combines BPF > > > > > token object creation *and* pinning in BPF FS. Such change ensures that BPF > > > > > token is always associated with a specific instance of BPF FS and cannot > > > > > "escape" it by application re-pinning it somewhere else using another > > > > > BPF_OBJ_PIN call. Now, BPF token can only be pinned once during its creation, > > > > > better containing it inside intended container (under assumption BPF FS is set > > > > > up in such a way as to not be shared with other containers on the system). > > > > > > > > > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > > > > > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > > > > > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > > > > > > > > > > v3->v3-resend: > > > > > - I started integrating token_fd into bpf_object_open_opts and higher-level > > > > > libbpf bpf_object APIs, but it started going a bit deeper into bpf_object > > > > > implementation details and how libbpf performs feature detection and > > > > > caching, so I decided to keep it separate from this patch set and not > > > > > distract from the mostly kernel-side changes; > > > > > v2->v3: > > > > > - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow > > > > > BPF_OBJ_PIN for BPF token; > > > > > v1->v2: > > > > > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > > > > > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). > > > > > > > > > > Andrii Nakryiko (14): > > > > > bpf: introduce BPF token object > > > > > libbpf: add bpf_token_create() API > > > > > selftests/bpf: add BPF_TOKEN_CREATE test > > > > > bpf: add BPF token support to BPF_MAP_CREATE command > > > > > libbpf: add BPF token support to bpf_map_create() API > > > > > selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command > > > > > bpf: add BPF token support to BPF_BTF_LOAD command > > > > > libbpf: add BPF token support to bpf_btf_load() API > > > > > selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest > > > > > bpf: add BPF token support to BPF_PROG_LOAD command > > > > > bpf: take into account BPF token when fetching helper protos > > > > > bpf: consistenly use BPF token throughout BPF verifier logic > > > > > libbpf: add BPF token support to bpf_prog_load() API > > > > > selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests > > > > > > > > > > drivers/media/rc/bpf-lirc.c | 2 +- > > > > > include/linux/bpf.h | 79 ++++- > > > > > include/linux/filter.h | 2 +- > > > > > include/uapi/linux/bpf.h | 53 ++++ > > > > > kernel/bpf/Makefile | 2 +- > > > > > kernel/bpf/arraymap.c | 2 +- > > > > > kernel/bpf/cgroup.c | 6 +- > > > > > kernel/bpf/core.c | 3 +- > > > > > kernel/bpf/helpers.c | 6 +- > > > > > kernel/bpf/inode.c | 46 ++- > > > > > kernel/bpf/syscall.c | 183 +++++++++--- > > > > > kernel/bpf/token.c | 201 +++++++++++++ > > > > > kernel/bpf/verifier.c | 13 +- > > > > > kernel/trace/bpf_trace.c | 2 +- > > > > > net/core/filter.c | 36 +-- > > > > > net/ipv4/bpf_tcp_ca.c | 2 +- > > > > > net/netfilter/nf_bpf_link.c | 2 +- > > > > > tools/include/uapi/linux/bpf.h | 53 ++++ > > > > > tools/lib/bpf/bpf.c | 35 ++- > > > > > tools/lib/bpf/bpf.h | 45 ++- > > > > > tools/lib/bpf/libbpf.map | 1 + > > > > > .../selftests/bpf/prog_tests/libbpf_probes.c | 4 + > > > > > .../selftests/bpf/prog_tests/libbpf_str.c | 6 + > > > > > .../testing/selftests/bpf/prog_tests/token.c | 277 ++++++++++++++++++ > > > > > 24 files changed, 957 insertions(+), 104 deletions(-) > > > > > create mode 100644 kernel/bpf/token.c > > > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c > > > > > > > > > > -- > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > > > > Hi Andrii, > > > > > > > > Thanks for your proposal. > > > > That seems to be a useful functionality, and yet I have some questions. > > > > > > I've answered them below. But I don't think either of them have any > > > relation to BPF token and the problem I'm trying to solve. > > > > > > > > > > > 1. Why can't we add security_bpf_probe_read_{kernel,user}? > > > > If possible, we can use these LSM hooks to refuse the process to > > > > read other tasks' information. E.g. if the other process is not within > > > > the same cgroup or the same namespace, we just refuse the reading. I > > > > think it is not hard to identify if the other process is within the > > > > same cgroup or the same namespace. > > > > > > There are probably many reasons. First, performance-wide, LSM hook for > > > each bpf_probe_read_{kernel,user}() call will be prohibitive. And just > > > in general, one would need to be very careful with such LSM hooks, > > > because bpf_probe_read_{kernel,user}() often happens from NMI context, > > > and LSM policy would have to be written and validated very carefully > > > with NMI context in mind. > > > > > > But, more conceptually, for probe_read you get a random address and > > > you know the process context you are running in (but you might be > > > actually running in softirq and NMI, and that process context is > > > irrelevant). How can you efficiently (or at all) tell if that random > > > address "belongs" to cgroup or namespace? Just at conceptual level? > > > > > > > > > > > 2. Why can't we extend bpf_cookie? > > > > We're now using bpf_cookie to identify each user or each > > > > application, and only the permitted cookies can create new probe > > > > links. However we find the bpf_cookie is only supported by tracing, > > > > perf_event and kprobe_multi, so we're planning to extend it to other > > > > possible link types, then we can use LSM hooks to control all bpf > > > > links. I think that the upstream kernel should also support > > > > bpf_cookie for all bpf links. If possible, we will post it to the > > > > upstream in the future. > > > > After I have read your BPF token proposal, I just have some other > > > > ideas. Why can't we just extend bpf_cookie to all other BPF objects? > > > > For example, all progs and maps should also have the bpf_cookie. > > > > > > > > > > I'm not exactly clear how you use BPF cookie, but it wasn't intended > > > to provide any sort of security or validation policy. It's purely a > > > user-provided u64 to help distinguish different attach points when the > > > same BPF program is attached in multiple places (e.g., kprobe tracing > > > many different kernel functions and needing to distinguish between > > > them at runtime). > > > > In our container environment, we enable the CAP_BPF, CAP_PERMON and > > CAP_NET_ADMIN for the containers which want to run BPF programs > > inside. However we don't want them to run whatever BPF programs they > > want. We only allow them to run the BPF programs we have permitted for > > each of them. So we are using LSM to audit the BPF behavior such as > > prog load, map creation and link attach. We define different BPF > > policies for different containers. In order to identify different > > containers efficiently, we assign different bpf_cookies for different > > containers. bpf_cookie is a u64, that's enough for our use cases. > > I can see how you can use BPF cookies for this, but it's certainly not > an intended use case :) BPF cookie is most useful on BPF side of > things. The utilization of the bpf_cookie appid in our use case has proven to be valuable, thus we continue to rely on its functionality :) > > But what you are describing is meant to be doable with BPF token. It's > not in first patch set, but I intended to allow user to specify an > extra "user context" blog of bytes which would be stored with BPF > token. And this data should be accessible from BPF LSM programs to > make extra custom policy decisions. But we need to agree on initial > BPF token stuff first, and then build out all the rest. Sounds good. Introducing support for user context within the BPF token would enhance its utility and provide even more valuable functionality.