mbox series

[RFC,bpf-next,seccomp,00/12] eBPF seccomp filters

Message ID cover.1620499942.git.yifeifz2@illinois.edu (mailing list archive)
Headers show
Series eBPF seccomp filters | expand

Message

YiFei Zhu May 10, 2021, 5:22 p.m. UTC
From: YiFei Zhu <yifeifz2@illinois.edu>

Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html

This patchset enables seccomp filters to be written in eBPF.
Supporting eBPF filters has been proposed a few times in the past.
The main concerns were (1) use cases and (2) security. We have
identified many use cases that can benefit from advanced eBPF
filters, such as:

  * exec-only-once filter / apply filter after exec
  * syscall logging (eg. via maps)
  * expressiveness & better tooling (no need for DSLs like easyseccomp)
  * contained syscall fault injection
  * Temporal System Call Specialization [1] with restrictive
    initialization phases (serving phase syscalls are filtered)
  * possible future extensions such as syscall serialization and
    argument rewriting

These features can also be achieved by user notifier + ptrace but
unfortunately user notifier is a lot of context switches (see the
benchmark results below), and hence much less efficient than eBPF.

For security, for an unprivileged caller, our implementation is as
restrictive as user notifier + ptrace, in regards to capabilities.
eBPF helpers follow the privilege model of original eBPF helpers.

Advanced eBPF feature (maps & helpers) is restricted by a new LSM
hook seccomp_extended. If LSM permits these features, then all standard
bpf helpers are permitted, and tracing helpers are permitted too if the
loader is bpf_capable and perfmon_capable. Mutable privileges should
not be a concern because if seccomp-eBPF is used to implement a mutable
policy of privileges, such policy can be implemented using user
notifier anyhow (which does not require seccomp-eBPF).

Moreover, a mechanism for reading user memory is added. The same
prototypes of bpf_probe_read_user{,str} from tracing are used. However,
when the loader of bpf program does not have CAP_PTRACE, the helper
will return -EPERM if the task under seccomp filter is non-dumpable.
The reason for this is that if we perform reduction from seccomp-eBPF
to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
a non-dumpable process. However, eBPF does not solve the TOCTOU problem
of user notifier, so users should not use this to enforce a policy
based on memory contents.

In addition, a mechanism for storing process states between filter runs
is added. This uses the BPF-LSM task storage. However, since
unprivileged bpf loaders do not have access to ptr to BTF ID for use as
the task parameter to the helpers, the workaround is to use NULL as the
parameter, and the helper will fallback to current's group leader. This
is insufficient, unfortunately, because of the BTF enforcement in
bpf_local_storage_map_alloc_check, and the fact that tasks without
bpf_capable cannot load map BTFs. (Can I ask why this is restricted
this way?)

Giuseppe Scrivano shows how to support eBPF filters in crun [2], based
on which we have tested a number of stateful filters.

Performance wise, Jinghao did a test of 1,000,000 getpid() calls on an
Intel i7-9700K, with stock Ubuntu config. The syscalls are half EPERM
and half passthrough to the getpid() syscall handler [3]. The tests
are done recording a median of 10:

                user notif      eBPF            ratio
QEMU            6808104 us      80508.5 us      84.6
Bare Metal      3403667.5 us    80316 us        42.4

[1] https://www.usenix.org/conference/usenixsecurity20/presentation/ghavamnia
[2] https://github.com/giuseppe/crun/commit/3906b4fbcb671f8f188deef08c94ceae86a80120
[3] https://github.com/xlab-uiuc/seccomp-ebpf-upstream/tree/perf-test

Patch 1 moves no_new_privs check in filter loading.
Patch 2 implements basic support for seccomp-eBPF in the kernel.
Patch 3 enables a ptracer to get a fd to the eBPF for CRIU.
Patch 4 enables libbpf to recognize the section "seccomp".
Patch 5 adds a sample program test_seccomp to samples/bpf.

Patch 6 adds an LSM hook seccomp_extended.
Patch 7 allows bpf verifier hooks to restrict direct map access.
Patch 8 implements restrictions for eBPF filters depending on LSM hooks.
Patch 9 lets Yama LSM restrict seccomp-ebpf based on ptrace_scope.

Patch 10 enables seccomp-ebpf to read user memory.
Patch 11 allows bpf helpers to have nullable ptr to BTF ID as argument.
Patch 12 implements process storage using BPF-LSM task storage.

Sargun Dhillon (3):
  bpf, seccomp: Add eBPF filter capabilities
  seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp
    filters
  samples/bpf: Add eBPF seccomp sample programs

YiFei Zhu (9):
  seccomp: Move no_new_privs check to after prepare_filter
  libbpf: recognize section "seccomp"
  lsm: New hook seccomp_extended
  bpf/verifier: allow restricting direct map access
  seccomp-ebpf: restrict filter to almost cBPF if LSM request such
  yama: (concept) restrict seccomp-eBPF with ptrace_scope
  seccomp-ebpf: Add ability to read user memory
  bpf/verifier: support NULL-able ptr to BTF ID as helper argument
  seccomp-ebpf: support task storage from BPF-LSM, defaulting to group
    leader

 arch/Kconfig                    |   7 +
 include/linux/bpf.h             |   8 ++
 include/linux/bpf_types.h       |   4 +
 include/linux/lsm_hook_defs.h   |   4 +
 include/linux/seccomp.h         |  15 +-
 include/linux/security.h        |  13 ++
 include/uapi/linux/bpf.h        |   1 +
 include/uapi/linux/ptrace.h     |   2 +
 include/uapi/linux/seccomp.h    |   1 +
 kernel/bpf/bpf_task_storage.c   |  64 +++++++--
 kernel/bpf/syscall.c            |   1 +
 kernel/bpf/verifier.c           |  15 +-
 kernel/ptrace.c                 |   4 +
 kernel/seccomp.c                | 235 ++++++++++++++++++++++++++++----
 kernel/trace/bpf_trace.c        |  42 ++++++
 samples/bpf/Makefile            |   3 +
 samples/bpf/test_seccomp_kern.c |  41 ++++++
 samples/bpf/test_seccomp_user.c |  49 +++++++
 security/security.c             |   8 ++
 security/yama/yama_lsm.c        |  30 ++++
 tools/include/uapi/linux/bpf.h  |   1 +
 tools/lib/bpf/libbpf.c          |   1 +
 22 files changed, 511 insertions(+), 38 deletions(-)
 create mode 100644 samples/bpf/test_seccomp_kern.c
 create mode 100644 samples/bpf/test_seccomp_user.c

--
2.31.1

Comments

Andy Lutomirski May 10, 2021, 5:47 p.m. UTC | #1
On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>
> From: YiFei Zhu <yifeifz2@illinois.edu>
>
> Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
>
> This patchset enables seccomp filters to be written in eBPF.
> Supporting eBPF filters has been proposed a few times in the past.
> The main concerns were (1) use cases and (2) security. We have
> identified many use cases that can benefit from advanced eBPF
> filters, such as:

I haven't reviewed this carefully, but I think we need to distinguish
a few things:

1. Using the eBPF *language*.

2. Allowing the use of stateful / non-pure eBPF features.

3. Allowing the eBPF programs to read the target process' memory.

I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
even less convinced by (3).

>
>   * exec-only-once filter / apply filter after exec

This is (2).  I'm not sure it's a good idea.

>   * syscall logging (eg. via maps)

This is (2).  Probably useful, but doesn't obviously belong in
seccomp, or at least not as part of the same seccomp feature as
regular filtering.

>   * expressiveness & better tooling (no need for DSLs like easyseccomp)

(1).  Sounds good.

>   * contained syscall fault injection

(2)?  We can already do this with notifiers.

> For security, for an unprivileged caller, our implementation is as
> restrictive as user notifier + ptrace, in regards to capabilities.
> eBPF helpers follow the privilege model of original eBPF helpers.

eBPF doesn't really have a privilege model yet.  There was a long and
disappointing thread about this awhile back.

> Moreover, a mechanism for reading user memory is added. The same
> prototypes of bpf_probe_read_user{,str} from tracing are used. However,
> when the loader of bpf program does not have CAP_PTRACE, the helper
> will return -EPERM if the task under seccomp filter is non-dumpable.
> The reason for this is that if we perform reduction from seccomp-eBPF
> to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
> a non-dumpable process. However, eBPF does not solve the TOCTOU problem
> of user notifier, so users should not use this to enforce a policy
> based on memory contents.

What is this for?
YiFei Zhu May 11, 2021, 5:21 a.m. UTC | #2
On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >
> > From: YiFei Zhu <yifeifz2@illinois.edu>
> >
> > Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
> >
> > This patchset enables seccomp filters to be written in eBPF.
> > Supporting eBPF filters has been proposed a few times in the past.
> > The main concerns were (1) use cases and (2) security. We have
> > identified many use cases that can benefit from advanced eBPF
> > filters, such as:
>
> I haven't reviewed this carefully, but I think we need to distinguish
> a few things:
>
> 1. Using the eBPF *language*.
>
> 2. Allowing the use of stateful / non-pure eBPF features.
>
> 3. Allowing the eBPF programs to read the target process' memory.
>
> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> even less convinced by (3).
>
> >
> >   * exec-only-once filter / apply filter after exec
>
> This is (2).  I'm not sure it's a good idea.

The basic idea is that for a container runtime it may wait to execute
a program in a container without that program being able to execve
another program, stopping any attack that involves loading another
binary. The container runtime can block any syscall but execve in the
exec-ed process by using only cBPF.

The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
@Andrea and @Giuseppe, could you clarify more in case I missed
something?

> >   * syscall logging (eg. via maps)
>
> This is (2).  Probably useful, but doesn't obviously belong in
> seccomp, or at least not as part of the same seccomp feature as
> regular filtering.
>
> >   * expressiveness & better tooling (no need for DSLs like easyseccomp)
>
> (1).  Sounds good.
>
> >   * contained syscall fault injection
>
> (2)?  We can already do this with notifiers.

To clarify, “we can already do with notifiers” isn’t the point here.
We can do almost everything if you have notifiers and ptrace, but it
may impose significant overhead (see the microbenchmark results).

The reason I’m saying the overhead is important is for the
reproduction / testing of certain race conditions. A syscall failing
quickly in a userspace application could, from a race condition, have
a completely different trace as a syscall failing after a few context
switches. eBPF makes quick fault injection possible.

> > For security, for an unprivileged caller, our implementation is as
> > restrictive as user notifier + ptrace, in regards to capabilities.
> > eBPF helpers follow the privilege model of original eBPF helpers.
>
> eBPF doesn't really have a privilege model yet.  There was a long and
> disappointing thread about this awhile back.

The idea is that “seccomp-eBPF does not make life easier for an
adversary”. Any attack an adversary could potentially utilize
seccomp-eBPF, they can do the same with other eBPF features, i.e. it
would be an issue with eBPF in general rather than specifically
seccomp’s use of eBPF.

Here it is referring to the helpers goes to the base
bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
!perfmon_capable). In this case, if the adversary would utilize eBPF
helpers to perform an attack, they could do it via another
unprivileged prog type.

That said, there are a few additional helpers this patchset is adding:
* get_current_uid_gid
* get_current_pid_tgid
  These two provide public information (are namespaces a concern?). I
have no idea what kind of exploit it could add unless the adversary
somehow side-channels the task_struct? But in that case, how is the
reading of task_struct different from how the rest of the kernel is
reading task_struct?
  Though, if knowing the global uid / pid is a concern then the eBPF
progs will need to keep track of namespaces, and that might not be
trivial.
* probe_read_user
* probe_read_user_str
  Reduction to ptrace. The privilege model of reading another
process’s data (via process_vm_readv or
ptrace(PTRACE_PEEK{TEXT,DATA})) is guarded by
PTRACE_MODE_ATTACH_REALCREDS. However, unprivileged seccomp is
safeguarded by no_new_privs, so, unless the caller have a non-uniform
resuid & fsuid, in which case it’s the caller’s failure to relinquish
privileges, ruid of the seccomp-eBPF executor (which is task whose
syscalls is being filtered) would be the save as the ruid of the
applier (the task that set the seccomp mode, at the time of setting
it).
  The main concern here is LSMs. LSMs can further restrict the scope
of ptrace hence I also allow LSMs to deny all “the use of stateful /
non-pure eBPF features”.
  As for side channels... the copy_from_user_nofault may allow an
adversary to observe what’s in resident memory and what’s swapped out,
but the adversary can already do this by observing the timing of
memory accesses. The non-nofault variant copy_from_user is used
everywhere in the kernel, so if an adversary were to side channel the
kernel by copy_from_user against an address, they can already do it by
using a syscall with a pointer that would be used by copy_from_user.
* task_storage_get
* task_storage_delete
  This is what I’m least sure about. The implementation of
task_storage is more complex than the other helpers, and also assumes
a privileged eBPF loader. It would slightly extend the attack surface.
If this is a big issue then eBPF can emulate such a map by using some
hashmap and having PID as key...

> > Moreover, a mechanism for reading user memory is added. The same
> > prototypes of bpf_probe_read_user{,str} from tracing are used. However,
> > when the loader of bpf program does not have CAP_PTRACE, the helper
> > will return -EPERM if the task under seccomp filter is non-dumpable.
> > The reason for this is that if we perform reduction from seccomp-eBPF
> > to user notifier + ptrace, ptrace requires CAP_PTRACE to read from
> > a non-dumpable process. However, eBPF does not solve the TOCTOU problem
> > of user notifier, so users should not use this to enforce a policy
> > based on memory contents.
>
> What is this for?

Memory reading opens up lots of use cases. For example, logging what
files are being opened without imposing too much performance penalty
from strace. Or as an accelerator for user notify emulation, where
syscalls can be rejected on a fast path if we know the memory contents
does not satisfy certain conditions that user notify will check.

YiFei Zhu
Andy Lutomirski May 15, 2021, 3:49 p.m. UTC | #3
On 5/10/21 10:21 PM, YiFei Zhu wrote:
> On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
>> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>>
>>> From: YiFei Zhu <yifeifz2@illinois.edu>
>>>
>>> Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
>>>
>>> This patchset enables seccomp filters to be written in eBPF.
>>> Supporting eBPF filters has been proposed a few times in the past.
>>> The main concerns were (1) use cases and (2) security. We have
>>> identified many use cases that can benefit from advanced eBPF
>>> filters, such as:
>>
>> I haven't reviewed this carefully, but I think we need to distinguish
>> a few things:
>>
>> 1. Using the eBPF *language*.
>>
>> 2. Allowing the use of stateful / non-pure eBPF features.
>>
>> 3. Allowing the eBPF programs to read the target process' memory.
>>
>> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
>> even less convinced by (3).
>>
>>>
>>>   * exec-only-once filter / apply filter after exec
>>
>> This is (2).  I'm not sure it's a good idea.
> 
> The basic idea is that for a container runtime it may wait to execute
> a program in a container without that program being able to execve
> another program, stopping any attack that involves loading another
> binary. The container runtime can block any syscall but execve in the
> exec-ed process by using only cBPF.
> 
> The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> @Andrea and @Giuseppe, could you clarify more in case I missed
> something?

We've discussed having a notifier-using filter be able to replace its
filter.  This would allow this and other use cases without any
additional eBPF or cBPF code.

>> eBPF doesn't really have a privilege model yet.  There was a long and
>> disappointing thread about this awhile back.
> 
> The idea is that “seccomp-eBPF does not make life easier for an
> adversary”. Any attack an adversary could potentially utilize
> seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> would be an issue with eBPF in general rather than specifically
> seccomp’s use of eBPF.
> 
> Here it is referring to the helpers goes to the base
> bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> !perfmon_capable). In this case, if the adversary would utilize eBPF
> helpers to perform an attack, they could do it via another
> unprivileged prog type.
> 
> That said, there are a few additional helpers this patchset is adding:
> * get_current_uid_gid
> * get_current_pid_tgid
>   These two provide public information (are namespaces a concern?). I
> have no idea what kind of exploit it could add unless the adversary
> somehow side-channels the task_struct? But in that case, how is the
> reading of task_struct different from how the rest of the kernel is
> reading task_struct?

Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
(what ever happened to that?), and it likely has the same problems for
seccomp.

>>
>> What is this for?
> 
> Memory reading opens up lots of use cases. For example, logging what
> files are being opened without imposing too much performance penalty
> from strace. Or as an accelerator for user notify emulation, where
> syscalls can be rejected on a fast path if we know the memory contents
> does not satisfy certain conditions that user notify will check.
> 

This has all kinds of race conditions.


I hate to be a party pooper, but this patchset is going to very high bar
to acceptance.  Right now, seccomp has a couple of excellent properties:

First, while it has limited expressiveness, it is simple enough that the
implementation can be easily understood and the scope for
vulnerabilities that fall through the cracks of the seccomp sandbox
model is low.  Compare this to Windows' low-integrity/high-integrity
sandbox system: there is a never ending string of sandbox escapes due to
token misuse, unexpected things at various integrity levels, etc.
Seccomp doesn't have tokens or integrity levels, and these bugs don't
happen.

Second, seccomp works, almost unchanged, in a completely unprivileged
context.  The last time making eBPF work sensibly in a less- or
-unprivileged context, the maintainers mostly rejected the idea of
developing/debugging a permission model for maps, cleaning up the bpf
object id system, etc.  You are going to have a very hard time
convincing the seccomp maintainers to let any of these mechanism
interact with seccomp until the underlying permission model is in place.

--Andy
Tianyin Xu May 16, 2021, 8:38 a.m. UTC | #4
On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >>>
> >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> >>>
> >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> >>>
> >>> This patchset enables seccomp filters to be written in eBPF.
> >>> Supporting eBPF filters has been proposed a few times in the past.
> >>> The main concerns were (1) use cases and (2) security. We have
> >>> identified many use cases that can benefit from advanced eBPF
> >>> filters, such as:
> >>
> >> I haven't reviewed this carefully, but I think we need to distinguish
> >> a few things:
> >>
> >> 1. Using the eBPF *language*.
> >>
> >> 2. Allowing the use of stateful / non-pure eBPF features.
> >>
> >> 3. Allowing the eBPF programs to read the target process' memory.
> >>
> >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> >> even less convinced by (3).
> >>
> >>>
> >>>   * exec-only-once filter / apply filter after exec
> >>
> >> This is (2).  I'm not sure it's a good idea.
> >
> > The basic idea is that for a container runtime it may wait to execute
> > a program in a container without that program being able to execve
> > another program, stopping any attack that involves loading another
> > binary. The container runtime can block any syscall but execve in the
> > exec-ed process by using only cBPF.
> >
> > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > @Andrea and @Giuseppe, could you clarify more in case I missed
> > something?
>
> We've discussed having a notifier-using filter be able to replace its
> filter.  This would allow this and other use cases without any
> additional eBPF or cBPF code.
>

A notifier is not always a solution (even ignoring its perf overhead).

One problem, pointed out by Andrea Arcangeli, is that notifiers need
userspace daemons. So, it can hardly be used by daemonless container
engines like Podman.

And, /* sorry for repeating.. */ the performance overhead of notifiers
is not close to ebpf, which prevents use cases that require native
performance.


> >> eBPF doesn't really have a privilege model yet.  There was a long and
> >> disappointing thread about this awhile back.
> >
> > The idea is that “seccomp-eBPF does not make life easier for an
> > adversary”. Any attack an adversary could potentially utilize
> > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > would be an issue with eBPF in general rather than specifically
> > seccomp’s use of eBPF.
> >
> > Here it is referring to the helpers goes to the base
> > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > helpers to perform an attack, they could do it via another
> > unprivileged prog type.
> >
> > That said, there are a few additional helpers this patchset is adding:
> > * get_current_uid_gid
> > * get_current_pid_tgid
> >   These two provide public information (are namespaces a concern?). I
> > have no idea what kind of exploit it could add unless the adversary
> > somehow side-channels the task_struct? But in that case, how is the
> > reading of task_struct different from how the rest of the kernel is
> > reading task_struct?
>
> Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> (what ever happened to that?), and it likely has the same problems for
> seccomp.
>
> >>
> >> What is this for?
> >
> > Memory reading opens up lots of use cases. For example, logging what
> > files are being opened without imposing too much performance penalty
> > from strace. Or as an accelerator for user notify emulation, where
> > syscalls can be rejected on a fast path if we know the memory contents
> > does not satisfy certain conditions that user notify will check.
> >
>
> This has all kinds of race conditions.
>
>
> I hate to be a party pooper, but this patchset is going to very high bar
> to acceptance.  Right now, seccomp has a couple of excellent properties:
>
> First, while it has limited expressiveness, it is simple enough that the
> implementation can be easily understood and the scope for
> vulnerabilities that fall through the cracks of the seccomp sandbox
> model is low.  Compare this to Windows' low-integrity/high-integrity
> sandbox system: there is a never ending string of sandbox escapes due to
> token misuse, unexpected things at various integrity levels, etc.
> Seccomp doesn't have tokens or integrity levels, and these bugs don't
> happen.
>
> Second, seccomp works, almost unchanged, in a completely unprivileged
> context.  The last time making eBPF work sensibly in a less- or
> -unprivileged context, the maintainers mostly rejected the idea of
> developing/debugging a permission model for maps, cleaning up the bpf
> object id system, etc.  You are going to have a very hard time
> convincing the seccomp maintainers to let any of these mechanism
> interact with seccomp until the underlying permission model is in place.
>
> --Andy

Thanks for pointing out the tradeoff between expressiveness vs. simplicity.

Note that we are _not_ proposing to replace cbpf, but propose to also
support ebpf filters. There certainly are use cases where cbpf is
sufficient, but there are also important use cases ebpf could make
life much easier.

Most importantly, we strongly believe that ebpf filters can be
supported without reducing security.

No worries about “party pooping” and we appreciate the feedback. We’d
love to hear concerns and collect feedback so we can address them to
hit that very high bar.


~t

--
Tianyin Xu
University of Illinois at Urbana-Champaign
https://tianyin.github.io/
Tycho Andersen May 17, 2021, 3:40 p.m. UTC | #5
On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > >>>
> > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > >>>
> > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > >>>
> > >>> This patchset enables seccomp filters to be written in eBPF.
> > >>> Supporting eBPF filters has been proposed a few times in the past.
> > >>> The main concerns were (1) use cases and (2) security. We have
> > >>> identified many use cases that can benefit from advanced eBPF
> > >>> filters, such as:
> > >>
> > >> I haven't reviewed this carefully, but I think we need to distinguish
> > >> a few things:
> > >>
> > >> 1. Using the eBPF *language*.
> > >>
> > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > >>
> > >> 3. Allowing the eBPF programs to read the target process' memory.
> > >>
> > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > >> even less convinced by (3).
> > >>
> > >>>
> > >>>   * exec-only-once filter / apply filter after exec
> > >>
> > >> This is (2).  I'm not sure it's a good idea.
> > >
> > > The basic idea is that for a container runtime it may wait to execute
> > > a program in a container without that program being able to execve
> > > another program, stopping any attack that involves loading another
> > > binary. The container runtime can block any syscall but execve in the
> > > exec-ed process by using only cBPF.
> > >
> > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > something?
> >
> > We've discussed having a notifier-using filter be able to replace its
> > filter.  This would allow this and other use cases without any
> > additional eBPF or cBPF code.
> >
> 
> A notifier is not always a solution (even ignoring its perf overhead).
> 
> One problem, pointed out by Andrea Arcangeli, is that notifiers need
> userspace daemons. So, it can hardly be used by daemonless container
> engines like Podman.

I'm not sure I buy this argument. Podman already has a conmon instance
for each container, this could be a child of that conmon process, or
live inside conmon itself.

Tycho
Sargun Dhillon May 17, 2021, 5:07 p.m. UTC | #6
On Sun, May 16, 2021 at 1:39 AM Tianyin Xu <tyxu@illinois.edu> wrote:
>
> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > >>>
> > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > >>>
> > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > >>>
> > >>> This patchset enables seccomp filters to be written in eBPF.
> > >>> Supporting eBPF filters has been proposed a few times in the past.
> > >>> The main concerns were (1) use cases and (2) security. We have
> > >>> identified many use cases that can benefit from advanced eBPF
> > >>> filters, such as:
> > >>
> > >> I haven't reviewed this carefully, but I think we need to distinguish
> > >> a few things:
> > >>
> > >> 1. Using the eBPF *language*.
> > >>
> > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > >>
> > >> 3. Allowing the eBPF programs to read the target process' memory.
> > >>
> > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > >> even less convinced by (3).
> > >>
> > >>>
> > >>>   * exec-only-once filter / apply filter after exec
> > >>
> > >> This is (2).  I'm not sure it's a good idea.
> > >
> > > The basic idea is that for a container runtime it may wait to execute
> > > a program in a container without that program being able to execve
> > > another program, stopping any attack that involves loading another
> > > binary. The container runtime can block any syscall but execve in the
> > > exec-ed process by using only cBPF.
> > >
> > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > something?
> >
> > We've discussed having a notifier-using filter be able to replace its
> > filter.  This would allow this and other use cases without any
> > additional eBPF or cBPF code.
> >
>
> A notifier is not always a solution (even ignoring its perf overhead).
>
> One problem, pointed out by Andrea Arcangeli, is that notifiers need
> userspace daemons. So, it can hardly be used by daemonless container
> engines like Podman.
>
> And, /* sorry for repeating.. */ the performance overhead of notifiers
> is not close to ebpf, which prevents use cases that require native
> performance.

While I agree with you that this is the case right now, there's no reason it
has to be the case. There's a variety of mechanisms that can be employed
to significantly speed up the performance of the notifier. For example, right
now the notifier is behind one large per-filter lock. That could be removed
allowing for better concurrency. There are a large number of mechanisms
that scale O(n) with the outstanding notifications -- again, something
that could be improved.

The other big improvement that could be made is being able to use something
like io_uring with the notifier interface, but it would require a
fairly significant
user API change -- and a move away from ioctl. I'm not sure if people are
excited about that idea at the moment.

>
>
> > >> eBPF doesn't really have a privilege model yet.  There was a long and
> > >> disappointing thread about this awhile back.
> > >
> > > The idea is that “seccomp-eBPF does not make life easier for an
> > > adversary”. Any attack an adversary could potentially utilize
> > > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > > would be an issue with eBPF in general rather than specifically
> > > seccomp’s use of eBPF.
> > >
> > > Here it is referring to the helpers goes to the base
> > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > > helpers to perform an attack, they could do it via another
> > > unprivileged prog type.
> > >
> > > That said, there are a few additional helpers this patchset is adding:
> > > * get_current_uid_gid
> > > * get_current_pid_tgid
> > >   These two provide public information (are namespaces a concern?). I
> > > have no idea what kind of exploit it could add unless the adversary
> > > somehow side-channels the task_struct? But in that case, how is the
> > > reading of task_struct different from how the rest of the kernel is
> > > reading task_struct?
> >
> > Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> > (what ever happened to that?), and it likely has the same problems for
> > seccomp.
> >
> > >>
> > >> What is this for?
> > >
> > > Memory reading opens up lots of use cases. For example, logging what
> > > files are being opened without imposing too much performance penalty
> > > from strace. Or as an accelerator for user notify emulation, where
> > > syscalls can be rejected on a fast path if we know the memory contents
> > > does not satisfy certain conditions that user notify will check.
> > >
> >
> > This has all kinds of race conditions.
> >
> >
> > I hate to be a party pooper, but this patchset is going to very high bar
> > to acceptance.  Right now, seccomp has a couple of excellent properties:
> >
> > First, while it has limited expressiveness, it is simple enough that the
> > implementation can be easily understood and the scope for
> > vulnerabilities that fall through the cracks of the seccomp sandbox
> > model is low.  Compare this to Windows' low-integrity/high-integrity
> > sandbox system: there is a never ending string of sandbox escapes due to
> > token misuse, unexpected things at various integrity levels, etc.
> > Seccomp doesn't have tokens or integrity levels, and these bugs don't
> > happen.
> >
> > Second, seccomp works, almost unchanged, in a completely unprivileged
> > context.  The last time making eBPF work sensibly in a less- or
> > -unprivileged context, the maintainers mostly rejected the idea of
> > developing/debugging a permission model for maps, cleaning up the bpf
> > object id system, etc.  You are going to have a very hard time
> > convincing the seccomp maintainers to let any of these mechanism
> > interact with seccomp until the underlying permission model is in place.
> >
> > --Andy
>
> Thanks for pointing out the tradeoff between expressiveness vs. simplicity.
>
> Note that we are _not_ proposing to replace cbpf, but propose to also
> support ebpf filters. There certainly are use cases where cbpf is
> sufficient, but there are also important use cases ebpf could make
> life much easier.
>
> Most importantly, we strongly believe that ebpf filters can be
> supported without reducing security.
>
> No worries about “party pooping” and we appreciate the feedback. We’d
> love to hear concerns and collect feedback so we can address them to
> hit that very high bar.
>
>
> ~t
>
> --
> Tianyin Xu
> University of Illinois at Urbana-Champaign
> https://tianyin.github.io/
Tianyin Xu May 20, 2021, 8:16 a.m. UTC | #7
On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>
> On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > >>>
> > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > >>>
> > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > >>>
> > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > >>> The main concerns were (1) use cases and (2) security. We have
> > > >>> identified many use cases that can benefit from advanced eBPF
> > > >>> filters, such as:
> > > >>
> > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > >> a few things:
> > > >>
> > > >> 1. Using the eBPF *language*.
> > > >>
> > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > >>
> > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > >>
> > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > >> even less convinced by (3).
> > > >>
> > > >>>
> > > >>>   * exec-only-once filter / apply filter after exec
> > > >>
> > > >> This is (2).  I'm not sure it's a good idea.
> > > >
> > > > The basic idea is that for a container runtime it may wait to execute
> > > > a program in a container without that program being able to execve
> > > > another program, stopping any attack that involves loading another
> > > > binary. The container runtime can block any syscall but execve in the
> > > > exec-ed process by using only cBPF.
> > > >
> > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > something?
> > >
> > > We've discussed having a notifier-using filter be able to replace its
> > > filter.  This would allow this and other use cases without any
> > > additional eBPF or cBPF code.
> > >
> >
> > A notifier is not always a solution (even ignoring its perf overhead).
> >
> > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > userspace daemons. So, it can hardly be used by daemonless container
> > engines like Podman.
>
> I'm not sure I buy this argument. Podman already has a conmon instance
> for each container, this could be a child of that conmon process, or
> live inside conmon itself.
>
> Tycho

I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.

You are right that Podman is not completely daemonless. However, “the
fact it's no entirely daemonless doesn't imply it's a good idea to
make it worse and to add complexity to the background conmon daemon or
to add more daemons.”

TL;DR. User notifiers are surely more flexible, but are also more
expensive and complex to implement, compared with ebpf filters. /*
I’ll reply to Sargun’s performance argument in a separate email */

I'm sure you know Podman well, but let me still move some jade from
Andrea and Giuseppe (all credits on podmon/crun are theirs) to
elaborate the point, for folks cced on the list who are not very
familiar with Podman.

Basically, the current order goes as follows:

         podman -> conmon -> crun -> container_binary
                               \
                                - seccomp done at crun level, not conmon

At runtime, what's left is:

         conmon -> container_binary  /* podman disappears; crun disappears */

So, to go through and use seccomp notify to block `exec`, we can
either start the container_binary with a seccomp agent wrapper, or
bloat the common binary (as pointed out by Tycho).

If we go with the first approach, we will have:

         podman -> conmon -> crun -> seccomp_agent -> container_binary

So, at runtime we'd be left with one more daemon:

        conmon -> seccomp_agent -> container_binary

Apparently, nobody likes one more daemon. So, the proposal from
Giuseppe was/is to use user notifiers as plugins (.so) loaded by
conmon:
https://github.com/containers/conmon/pull/190
https://github.com/containers/crun/pull/438

Now, with the ebpf filter support, one can implement the same thing
using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
this is well supported by crun.
Tianyin Xu May 20, 2021, 8:22 a.m. UTC | #8
On Mon, May 17, 2021 at 12:08 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Sun, May 16, 2021 at 1:39 AM Tianyin Xu <tyxu@illinois.edu> wrote:
> >
> > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > >>>
> > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > >>>
> > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > >>>
> > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > >>> The main concerns were (1) use cases and (2) security. We have
> > > >>> identified many use cases that can benefit from advanced eBPF
> > > >>> filters, such as:
> > > >>
> > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > >> a few things:
> > > >>
> > > >> 1. Using the eBPF *language*.
> > > >>
> > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > >>
> > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > >>
> > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > >> even less convinced by (3).
> > > >>
> > > >>>
> > > >>>   * exec-only-once filter / apply filter after exec
> > > >>
> > > >> This is (2).  I'm not sure it's a good idea.
> > > >
> > > > The basic idea is that for a container runtime it may wait to execute
> > > > a program in a container without that program being able to execve
> > > > another program, stopping any attack that involves loading another
> > > > binary. The container runtime can block any syscall but execve in the
> > > > exec-ed process by using only cBPF.
> > > >
> > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > something?
> > >
> > > We've discussed having a notifier-using filter be able to replace its
> > > filter.  This would allow this and other use cases without any
> > > additional eBPF or cBPF code.
> > >
> >
> > A notifier is not always a solution (even ignoring its perf overhead).
> >
> > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > userspace daemons. So, it can hardly be used by daemonless container
> > engines like Podman.
> >
> > And, /* sorry for repeating.. */ the performance overhead of notifiers
> > is not close to ebpf, which prevents use cases that require native
> > performance.
>
> While I agree with you that this is the case right now, there's no reason it
> has to be the case. There's a variety of mechanisms that can be employed
> to significantly speed up the performance of the notifier. For example, right
> now the notifier is behind one large per-filter lock. That could be removed
> allowing for better concurrency. There are a large number of mechanisms
> that scale O(n) with the outstanding notifications -- again, something
> that could be improved.

Thanks for the pointer! But, I don’t think this can fundamentally
eliminate the performance gap between the notifiers and the ebpf
filters. IMHO, the additional context switches of user notifiers make
the difference.

>
> The other big improvement that could be made is being able to use something
> like io_uring with the notifier interface, but it would require a
> fairly significant
> user API change -- and a move away from ioctl. I'm not sure if people are
> excited about that idea at the moment.
>

Apologize that I don’t fully understand your proposal. My
understanding about io_uring is that it allows you to amortize the
cost of context switch but not eliminate it, unless you are willing to
dedicate a core for it. I still believe that, even with io_uring, user
notifiers are going to be much slower than eBPF filters.

Btw, our patches are based on your patch set (thank you!). Are you
using user notifiers (with your improved version?) these days? It will
be nice to hear your opinions on ebpf filters.

> >
> >
> > > >> eBPF doesn't really have a privilege model yet.  There was a long and
> > > >> disappointing thread about this awhile back.
> > > >
> > > > The idea is that “seccomp-eBPF does not make life easier for an
> > > > adversary”. Any attack an adversary could potentially utilize
> > > > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > > > would be an issue with eBPF in general rather than specifically
> > > > seccomp’s use of eBPF.
> > > >
> > > > Here it is referring to the helpers goes to the base
> > > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > > > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > > > helpers to perform an attack, they could do it via another
> > > > unprivileged prog type.
> > > >
> > > > That said, there are a few additional helpers this patchset is adding:
> > > > * get_current_uid_gid
> > > > * get_current_pid_tgid
> > > >   These two provide public information (are namespaces a concern?). I
> > > > have no idea what kind of exploit it could add unless the adversary
> > > > somehow side-channels the task_struct? But in that case, how is the
> > > > reading of task_struct different from how the rest of the kernel is
> > > > reading task_struct?
> > >
> > > Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> > > (what ever happened to that?), and it likely has the same problems for
> > > seccomp.
> > >
> > > >>
> > > >> What is this for?
> > > >
> > > > Memory reading opens up lots of use cases. For example, logging what
> > > > files are being opened without imposing too much performance penalty
> > > > from strace. Or as an accelerator for user notify emulation, where
> > > > syscalls can be rejected on a fast path if we know the memory contents
> > > > does not satisfy certain conditions that user notify will check.
> > > >
> > >
> > > This has all kinds of race conditions.
> > >
> > >
> > > I hate to be a party pooper, but this patchset is going to very high bar
> > > to acceptance.  Right now, seccomp has a couple of excellent properties:
> > >
> > > First, while it has limited expressiveness, it is simple enough that the
> > > implementation can be easily understood and the scope for
> > > vulnerabilities that fall through the cracks of the seccomp sandbox
> > > model is low.  Compare this to Windows' low-integrity/high-integrity
> > > sandbox system: there is a never ending string of sandbox escapes due to
> > > token misuse, unexpected things at various integrity levels, etc.
> > > Seccomp doesn't have tokens or integrity levels, and these bugs don't
> > > happen.
> > >
> > > Second, seccomp works, almost unchanged, in a completely unprivileged
> > > context.  The last time making eBPF work sensibly in a less- or
> > > -unprivileged context, the maintainers mostly rejected the idea of
> > > developing/debugging a permission model for maps, cleaning up the bpf
> > > object id system, etc.  You are going to have a very hard time
> > > convincing the seccomp maintainers to let any of these mechanism
> > > interact with seccomp until the underlying permission model is in place.
> > >
> > > --Andy
> >
> > Thanks for pointing out the tradeoff between expressiveness vs. simplicity.
> >
> > Note that we are _not_ proposing to replace cbpf, but propose to also
> > support ebpf filters. There certainly are use cases where cbpf is
> > sufficient, but there are also important use cases ebpf could make
> > life much easier.
> >
> > Most importantly, we strongly believe that ebpf filters can be
> > supported without reducing security.
> >
> > No worries about “party pooping” and we appreciate the feedback. We’d
> > love to hear concerns and collect feedback so we can address them to
> > hit that very high bar.
> >
> >
> > ~t
> >
> > --
> > Tianyin Xu
> > University of Illinois at Urbana-Champaign
> > https://urldefense.com/v3/__https://tianyin.github.io/__;!!DZ3fjg!o4__Ob32oapUDg9_f6hzksoFiX9517CJ5-w8qtG9i-WKFs_xWbGQfUHpLjHjCddw$
Christian Brauner May 20, 2021, 8:56 a.m. UTC | #9
On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
> On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> >
> > On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> > > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > > >>>
> > > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > > >>>
> > > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > > >>>
> > > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > > >>> The main concerns were (1) use cases and (2) security. We have
> > > > >>> identified many use cases that can benefit from advanced eBPF
> > > > >>> filters, such as:
> > > > >>
> > > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > > >> a few things:
> > > > >>
> > > > >> 1. Using the eBPF *language*.
> > > > >>
> > > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > > >>
> > > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > > >>
> > > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > > >> even less convinced by (3).
> > > > >>
> > > > >>>
> > > > >>>   * exec-only-once filter / apply filter after exec
> > > > >>
> > > > >> This is (2).  I'm not sure it's a good idea.
> > > > >
> > > > > The basic idea is that for a container runtime it may wait to execute
> > > > > a program in a container without that program being able to execve
> > > > > another program, stopping any attack that involves loading another
> > > > > binary. The container runtime can block any syscall but execve in the
> > > > > exec-ed process by using only cBPF.
> > > > >
> > > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > > something?
> > > >
> > > > We've discussed having a notifier-using filter be able to replace its
> > > > filter.  This would allow this and other use cases without any
> > > > additional eBPF or cBPF code.
> > > >
> > >
> > > A notifier is not always a solution (even ignoring its perf overhead).
> > >
> > > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > > userspace daemons. So, it can hardly be used by daemonless container
> > > engines like Podman.
> >
> > I'm not sure I buy this argument. Podman already has a conmon instance
> > for each container, this could be a child of that conmon process, or
> > live inside conmon itself.
> >
> > Tycho
> 
> I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.
> 
> You are right that Podman is not completely daemonless. However, “the
> fact it's no entirely daemonless doesn't imply it's a good idea to
> make it worse and to add complexity to the background conmon daemon or
> to add more daemons.”
> 
> TL;DR. User notifiers are surely more flexible, but are also more
> expensive and complex to implement, compared with ebpf filters. /*
> I’ll reply to Sargun’s performance argument in a separate email */
> 
> I'm sure you know Podman well, but let me still move some jade from
> Andrea and Giuseppe (all credits on podmon/crun are theirs) to
> elaborate the point, for folks cced on the list who are not very
> familiar with Podman.
> 
> Basically, the current order goes as follows:
> 
>          podman -> conmon -> crun -> container_binary
>                                \
>                                 - seccomp done at crun level, not conmon
> 
> At runtime, what's left is:
> 
>          conmon -> container_binary  /* podman disappears; crun disappears */
> 
> So, to go through and use seccomp notify to block `exec`, we can
> either start the container_binary with a seccomp agent wrapper, or
> bloat the common binary (as pointed out by Tycho).
> 
> If we go with the first approach, we will have:
> 
>          podman -> conmon -> crun -> seccomp_agent -> container_binary
> 
> So, at runtime we'd be left with one more daemon:
> 
>         conmon -> seccomp_agent -> container_binary

That seems like a strawman. I don't see why this has to be out of
process or a separate daemon. Conmon uses a regular event loop. Adding
support for processing notifier syscall notifications is
straightforward. Moving it to a plugin as you mentioned below is a
design decision not a necessity.

> 
> Apparently, nobody likes one more daemon. So, the proposal from

I'm not sure such a blanket statements about an indeterminate group of
people's alleged preferences constitutes a technical argument wny we
need ebpf in seccomp.

> Giuseppe was/is to use user notifiers as plugins (.so) loaded by
> conmon:
> https://github.com/containers/conmon/pull/190
> https://github.com/containers/crun/pull/438
> 
> Now, with the ebpf filter support, one can implement the same thing
> using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
> this is well supported by crun.

So I think this is trying to jump the gun by saying "Look, the result
might be simpler.". That may even be the case - though I'm not yet
convinced - but Andy's point stands that this brings a slew of issues on
the table that need clear answers. Bringing stateful ebpf features into
seccomp is a pretty big step and especially around the
privilege/security model it looks pretty handwavy right now.

Christian
Christian Brauner May 20, 2021, 9:05 a.m. UTC | #10
On Sat, May 15, 2021 at 08:49:01AM -0700, Andy Lutomirski wrote:
> On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> >>>
> >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> >>>
> >>> Based on: https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html
> >>>
> >>> This patchset enables seccomp filters to be written in eBPF.
> >>> Supporting eBPF filters has been proposed a few times in the past.
> >>> The main concerns were (1) use cases and (2) security. We have
> >>> identified many use cases that can benefit from advanced eBPF
> >>> filters, such as:
> >>
> >> I haven't reviewed this carefully, but I think we need to distinguish
> >> a few things:
> >>
> >> 1. Using the eBPF *language*.
> >>
> >> 2. Allowing the use of stateful / non-pure eBPF features.
> >>
> >> 3. Allowing the eBPF programs to read the target process' memory.
> >>
> >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> >> even less convinced by (3).
> >>
> >>>
> >>>   * exec-only-once filter / apply filter after exec
> >>
> >> This is (2).  I'm not sure it's a good idea.
> > 
> > The basic idea is that for a container runtime it may wait to execute
> > a program in a container without that program being able to execve
> > another program, stopping any attack that involves loading another
> > binary. The container runtime can block any syscall but execve in the
> > exec-ed process by using only cBPF.
> > 
> > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > @Andrea and @Giuseppe, could you clarify more in case I missed
> > something?
> 
> We've discussed having a notifier-using filter be able to replace its
> filter.  This would allow this and other use cases without any
> additional eBPF or cBPF code.

Are you referring to sm like I sketched in
https://lore.kernel.org/containers/20210301110907.2qoxmiy55gpkgwnq@wittgenstein/
?

> 
> >> eBPF doesn't really have a privilege model yet.  There was a long and
> >> disappointing thread about this awhile back.
> > 
> > The idea is that “seccomp-eBPF does not make life easier for an
> > adversary”. Any attack an adversary could potentially utilize
> > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > would be an issue with eBPF in general rather than specifically
> > seccomp’s use of eBPF.
> > 
> > Here it is referring to the helpers goes to the base
> > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > helpers to perform an attack, they could do it via another
> > unprivileged prog type.
> > 
> > That said, there are a few additional helpers this patchset is adding:
> > * get_current_uid_gid
> > * get_current_pid_tgid
> >   These two provide public information (are namespaces a concern?). I

If they are seen from userspace in any way then these must be resolved
relative to the caller's userns or caller's pidns. So yes, namespaces
need to be taken into account.

> > have no idea what kind of exploit it could add unless the adversary
> > somehow side-channels the task_struct? But in that case, how is the
> > reading of task_struct different from how the rest of the kernel is
> > reading task_struct?
> 
> Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> (what ever happened to that?), and it likely has the same problems for
> seccomp.
> 
> >>
> >> What is this for?
> > 
> > Memory reading opens up lots of use cases. For example, logging what
> > files are being opened without imposing too much performance penalty
> > from strace. Or as an accelerator for user notify emulation, where
> > syscalls can be rejected on a fast path if we know the memory contents
> > does not satisfy certain conditions that user notify will check.
> > 
> 
> This has all kinds of race conditions.
> 
> 
> I hate to be a party pooper, but this patchset is going to very high bar
> to acceptance.  Right now, seccomp has a couple of excellent properties:
> 
> First, while it has limited expressiveness, it is simple enough that the
> implementation can be easily understood and the scope for
> vulnerabilities that fall through the cracks of the seccomp sandbox
> model is low.  Compare this to Windows' low-integrity/high-integrity
> sandbox system: there is a never ending string of sandbox escapes due to
> token misuse, unexpected things at various integrity levels, etc.
> Seccomp doesn't have tokens or integrity levels, and these bugs don't
> happen.
> 
> Second, seccomp works, almost unchanged, in a completely unprivileged
> context.  The last time making eBPF work sensibly in a less- or

Yeah, which is pretty important.
Christian Brauner May 20, 2021, 9:37 a.m. UTC | #11
On Thu, May 20, 2021 at 10:56:13AM +0200, Christian Brauner wrote:
> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
> > On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > >
> > > On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > > > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > > > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > > > >>>
> > > > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > > > >>>
> > > > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > > > >>>
> > > > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > > > >>> The main concerns were (1) use cases and (2) security. We have
> > > > > >>> identified many use cases that can benefit from advanced eBPF
> > > > > >>> filters, such as:
> > > > > >>
> > > > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > > > >> a few things:
> > > > > >>
> > > > > >> 1. Using the eBPF *language*.
> > > > > >>
> > > > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > > > >>
> > > > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > > > >>
> > > > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > > > >> even less convinced by (3).
> > > > > >>
> > > > > >>>
> > > > > >>>   * exec-only-once filter / apply filter after exec
> > > > > >>
> > > > > >> This is (2).  I'm not sure it's a good idea.
> > > > > >
> > > > > > The basic idea is that for a container runtime it may wait to execute
> > > > > > a program in a container without that program being able to execve
> > > > > > another program, stopping any attack that involves loading another
> > > > > > binary. The container runtime can block any syscall but execve in the
> > > > > > exec-ed process by using only cBPF.
> > > > > >
> > > > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > > > something?
> > > > >
> > > > > We've discussed having a notifier-using filter be able to replace its
> > > > > filter.  This would allow this and other use cases without any
> > > > > additional eBPF or cBPF code.
> > > > >
> > > >
> > > > A notifier is not always a solution (even ignoring its perf overhead).
> > > >
> > > > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > > > userspace daemons. So, it can hardly be used by daemonless container
> > > > engines like Podman.
> > >
> > > I'm not sure I buy this argument. Podman already has a conmon instance
> > > for each container, this could be a child of that conmon process, or
> > > live inside conmon itself.
> > >
> > > Tycho
> > 
> > I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.
> > 
> > You are right that Podman is not completely daemonless. However, “the
> > fact it's no entirely daemonless doesn't imply it's a good idea to
> > make it worse and to add complexity to the background conmon daemon or
> > to add more daemons.”
> > 
> > TL;DR. User notifiers are surely more flexible, but are also more
> > expensive and complex to implement, compared with ebpf filters. /*
> > I’ll reply to Sargun’s performance argument in a separate email */
> > 
> > I'm sure you know Podman well, but let me still move some jade from
> > Andrea and Giuseppe (all credits on podmon/crun are theirs) to
> > elaborate the point, for folks cced on the list who are not very
> > familiar with Podman.
> > 
> > Basically, the current order goes as follows:
> > 
> >          podman -> conmon -> crun -> container_binary
> >                                \
> >                                 - seccomp done at crun level, not conmon
> > 
> > At runtime, what's left is:
> > 
> >          conmon -> container_binary  /* podman disappears; crun disappears */
> > 
> > So, to go through and use seccomp notify to block `exec`, we can
> > either start the container_binary with a seccomp agent wrapper, or
> > bloat the common binary (as pointed out by Tycho).
> > 
> > If we go with the first approach, we will have:
> > 
> >          podman -> conmon -> crun -> seccomp_agent -> container_binary
> > 
> > So, at runtime we'd be left with one more daemon:
> > 
> >         conmon -> seccomp_agent -> container_binary
> 
> That seems like a strawman. I don't see why this has to be out of
> process or a separate daemon. Conmon uses a regular event loop. Adding
> support for processing notifier syscall notifications is
> straightforward. Moving it to a plugin as you mentioned below is a
> design decision not a necessity.
> 
> > 
> > Apparently, nobody likes one more daemon. So, the proposal from
> 
> I'm not sure such a blanket statements about an indeterminate group of
> people's alleged preferences constitutes a technical argument wny we
> need ebpf in seccomp.

I was missing a :) here.

> 
> > Giuseppe was/is to use user notifiers as plugins (.so) loaded by
> > conmon:
> > https://github.com/containers/conmon/pull/190
> > https://github.com/containers/crun/pull/438
> > 
> > Now, with the ebpf filter support, one can implement the same thing
> > using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
> > this is well supported by crun.
> 
> So I think this is trying to jump the gun by saying "Look, the result
> might be simpler.". That may even be the case - though I'm not yet
> convinced - but Andy's point stands that this brings a slew of issues on
> the table that need clear answers. Bringing stateful ebpf features into
> seccomp is a pretty big step and especially around the
> privilege/security model it looks pretty handwavy right now.
> 
> Christian
>
Tianyin Xu May 20, 2021, 10:13 p.m. UTC | #12
On Thu, May 20, 2021 at 3:56 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
> > On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > >
> > > On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > > > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > > > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > > > >>>
> > > > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > > > >>>
> > > > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > > > >>>
> > > > > >>> This patchset enables seccomp filters to be written in eBPF.
> > > > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > > > >>> The main concerns were (1) use cases and (2) security. We have
> > > > > >>> identified many use cases that can benefit from advanced eBPF
> > > > > >>> filters, such as:
> > > > > >>
> > > > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > > > >> a few things:
> > > > > >>
> > > > > >> 1. Using the eBPF *language*.
> > > > > >>
> > > > > >> 2. Allowing the use of stateful / non-pure eBPF features.
> > > > > >>
> > > > > >> 3. Allowing the eBPF programs to read the target process' memory.
> > > > > >>
> > > > > >> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
> > > > > >> even less convinced by (3).
> > > > > >>
> > > > > >>>
> > > > > >>>   * exec-only-once filter / apply filter after exec
> > > > > >>
> > > > > >> This is (2).  I'm not sure it's a good idea.
> > > > > >
> > > > > > The basic idea is that for a container runtime it may wait to execute
> > > > > > a program in a container without that program being able to execve
> > > > > > another program, stopping any attack that involves loading another
> > > > > > binary. The container runtime can block any syscall but execve in the
> > > > > > exec-ed process by using only cBPF.
> > > > > >
> > > > > > The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
> > > > > > @Andrea and @Giuseppe, could you clarify more in case I missed
> > > > > > something?
> > > > >
> > > > > We've discussed having a notifier-using filter be able to replace its
> > > > > filter.  This would allow this and other use cases without any
> > > > > additional eBPF or cBPF code.
> > > > >
> > > >
> > > > A notifier is not always a solution (even ignoring its perf overhead).
> > > >
> > > > One problem, pointed out by Andrea Arcangeli, is that notifiers need
> > > > userspace daemons. So, it can hardly be used by daemonless container
> > > > engines like Podman.
> > >
> > > I'm not sure I buy this argument. Podman already has a conmon instance
> > > for each container, this could be a child of that conmon process, or
> > > live inside conmon itself.
> > >
> > > Tycho
> >
> > I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.
> >
> > You are right that Podman is not completely daemonless. However, “the
> > fact it's no entirely daemonless doesn't imply it's a good idea to
> > make it worse and to add complexity to the background conmon daemon or
> > to add more daemons.”
> >
> > TL;DR. User notifiers are surely more flexible, but are also more
> > expensive and complex to implement, compared with ebpf filters. /*
> > I’ll reply to Sargun’s performance argument in a separate email */
> >
> > I'm sure you know Podman well, but let me still move some jade from
> > Andrea and Giuseppe (all credits on podmon/crun are theirs) to
> > elaborate the point, for folks cced on the list who are not very
> > familiar with Podman.
> >
> > Basically, the current order goes as follows:
> >
> >          podman -> conmon -> crun -> container_binary
> >                                \
> >                                 - seccomp done at crun level, not conmon
> >
> > At runtime, what's left is:
> >
> >          conmon -> container_binary  /* podman disappears; crun disappears */
> >
> > So, to go through and use seccomp notify to block `exec`, we can
> > either start the container_binary with a seccomp agent wrapper, or
> > bloat the common binary (as pointed out by Tycho).
> >
> > If we go with the first approach, we will have:
> >
> >          podman -> conmon -> crun -> seccomp_agent -> container_binary
> >
> > So, at runtime we'd be left with one more daemon:
> >
> >         conmon -> seccomp_agent -> container_binary
>
> That seems like a strawman. I don't see why this has to be out of
> process or a separate daemon. Conmon uses a regular event loop. Adding
> support for processing notifier syscall notifications is
> straightforward. Moving it to a plugin as you mentioned below is a
> design decision not a necessity.
>
> >
> > Apparently, nobody likes one more daemon. So, the proposal from
>
> I'm not sure such a blanket statements about an indeterminate group of
> people's alleged preferences constitutes a technical argument wny we
> need ebpf in seccomp.
>
> > Giuseppe was/is to use user notifiers as plugins (.so) loaded by
> > conmon:
> > https://urldefense.com/v3/__https://github.com/containers/conmon/pull/190__;!!DZ3fjg!qFZ7PXfFe7eI1Bye9J8zsGOxTQQlfL-pBh0D7Arn1YZKevtEpA9uxKqMTP9kA5RJ$
> > https://urldefense.com/v3/__https://github.com/containers/crun/pull/438__;!!DZ3fjg!qFZ7PXfFe7eI1Bye9J8zsGOxTQQlfL-pBh0D7Arn1YZKevtEpA9uxKqMTJrKzhUD$
> >
> > Now, with the ebpf filter support, one can implement the same thing
> > using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
> > this is well supported by crun.
>
> So I think this is trying to jump the gun by saying "Look, the result
> might be simpler.". That may even be the case - though I'm not yet
> convinced - but Andy's point stands that this brings a slew of issues on
> the table that need clear answers. Bringing stateful ebpf features into
> seccomp is a pretty big step and especially around the
> privilege/security model it looks pretty handwavy right now.
>
> Christian

If an alleged gunshot was the impression I left, I apologize.
Seriously, I have great respect for user notifiers -- my intention was
never to disregard it, or to argue that ebpf filters are always
strictly better.

On the other hand, I do believe (and tried to show) ebpf filters have
their own technical advantages, and can be very useful and efficient
in many use cases. Let me know if you don’t buy this.

It’s kinda weird that we are arguing against ebpf filters with user
notifiers (it’s analogous to "we don’t need Seccomp coz we have
ptrace…")

More importantly, I do really want to provide clear answers to the
privilege/security model, but I don’t precisely know what exactly
you’re referring to as "privilege/security model". Are you referring
to the one-way transition model of Seccomp (which may no longer be
held in stateful filters), or something different? It will be great if
you can clarify so we can answer explicitly.

Thanks!
Sargun Dhillon May 24, 2021, 6:55 p.m. UTC | #13
On Thu, May 20, 2021 at 1:22 AM Tianyin Xu <tyxu@illinois.edu> wrote:
>
> On Mon, May 17, 2021 at 12:08 PM Sargun Dhillon <sargun@sargun.me> wrote:
> >
> > While I agree with you that this is the case right now, there's no reason it
> > has to be the case. There's a variety of mechanisms that can be employed
> > to significantly speed up the performance of the notifier. For example, right
> > now the notifier is behind one large per-filter lock. That could be removed
> > allowing for better concurrency. There are a large number of mechanisms
> > that scale O(n) with the outstanding notifications -- again, something
> > that could be improved.
>
> Thanks for the pointer! But, I don’t think this can fundamentally
> eliminate the performance gap between the notifiers and the ebpf
> filters. IMHO, the additional context switches of user notifiers make
> the difference.
>
I mean, I still think it can be closed. Or at least get better. I've
thought about
working on performance improvements, but they're lower on the list
than functionality changes.

> >
> > The other big improvement that could be made is being able to use something
> > like io_uring with the notifier interface, but it would require a
> > fairly significant
> > user API change -- and a move away from ioctl. I'm not sure if people are
> > excited about that idea at the moment.
> >
>
> Apologize that I don’t fully understand your proposal. My
> understanding about io_uring is that it allows you to amortize the
> cost of context switch but not eliminate it, unless you are willing to
> dedicate a core for it. I still believe that, even with io_uring, user
> notifiers are going to be much slower than eBPF filters.
The notifier gets significantly slower as a function of the notifications. If
you have a large number of notifications in flight, or if you're trying to
concurrently handle a large number of notifications, it gets slower. This
is where something like io_uring is super useful in terms of reducing
wakeups.

Also, in the original futex2 patches, it had a mechanism to better handle
(scheduling) of notifier like cases[1]. If the seccomp notifier did a similar
thing, we could see better performance.

>
> Btw, our patches are based on your patch set (thank you!). Are you
> using user notifiers (with your improved version?) these days? It will
> be nice to hear your opinions on ebpf filters.
>
I'm so glad that someone is picking up the work on this.

> > >
> > >
> > > > >> eBPF doesn't really have a privilege model yet.  There was a long and
> > > > >> disappointing thread about this awhile back.
> > > > >
> > > > > The idea is that “seccomp-eBPF does not make life easier for an
> > > > > adversary”. Any attack an adversary could potentially utilize
> > > > > seccomp-eBPF, they can do the same with other eBPF features, i.e. it
> > > > > would be an issue with eBPF in general rather than specifically
> > > > > seccomp’s use of eBPF.
> > > > >
> > > > > Here it is referring to the helpers goes to the base
> > > > > bpf_base_func_proto if the caller is unprivileged (!bpf_capable ||
> > > > > !perfmon_capable). In this case, if the adversary would utilize eBPF
> > > > > helpers to perform an attack, they could do it via another
> > > > > unprivileged prog type.
> > > > >
> > > > > That said, there are a few additional helpers this patchset is adding:
> > > > > * get_current_uid_gid
> > > > > * get_current_pid_tgid
> > > > >   These two provide public information (are namespaces a concern?). I
> > > > > have no idea what kind of exploit it could add unless the adversary
> > > > > somehow side-channels the task_struct? But in that case, how is the
> > > > > reading of task_struct different from how the rest of the kernel is
> > > > > reading task_struct?
> > > >
> > > > Yes, namespaces are a concern.  This idea got mostly shot down for kdbus
> > > > (what ever happened to that?), and it likely has the same problems for
> > > > seccomp.
> > > >
So, we actually have a case where we want to inspect an argument --
We want to look at the FD number that's passed to the sendmsg syscall, and then
see if that's an AF_INET socket, and if it is, then pass back to
notifier, otherwise
allow it to continue through. This is an area where I can see eBPF being
very useful.

> > > > >>
> > > > >> What is this for?
> > > > >
> > > > > Memory reading opens up lots of use cases. For example, logging what
> > > > > files are being opened without imposing too much performance penalty
> > > > > from strace. Or as an accelerator for user notify emulation, where
> > > > > syscalls can be rejected on a fast path if we know the memory contents
> > > > > does not satisfy certain conditions that user notify will check.
> > > > >
> > > >
> > > > This has all kinds of race conditions.
> > > >
> > > >
> > > > I hate to be a party pooper, but this patchset is going to very high bar
> > > > to acceptance.  Right now, seccomp has a couple of excellent properties:
> > > >
> > > > First, while it has limited expressiveness, it is simple enough that the
> > > > implementation can be easily understood and the scope for
> > > > vulnerabilities that fall through the cracks of the seccomp sandbox
> > > > model is low.  Compare this to Windows' low-integrity/high-integrity
> > > > sandbox system: there is a never ending string of sandbox escapes due to
> > > > token misuse, unexpected things at various integrity levels, etc.
> > > > Seccomp doesn't have tokens or integrity levels, and these bugs don't
> > > > happen.
> > > >
> > > > Second, seccomp works, almost unchanged, in a completely unprivileged
> > > > context.  The last time making eBPF work sensibly in a less- or
> > > > -unprivileged context, the maintainers mostly rejected the idea of
> > > > developing/debugging a permission model for maps, cleaning up the bpf
> > > > object id system, etc.  You are going to have a very hard time
> > > > convincing the seccomp maintainers to let any of these mechanism
> > > > interact with seccomp until the underlying permission model is in place.
> > > >
> > > > --Andy
> > >
> > > Thanks for pointing out the tradeoff between expressiveness vs. simplicity.
> > >
> > > Note that we are _not_ proposing to replace cbpf, but propose to also
> > > support ebpf filters. There certainly are use cases where cbpf is
> > > sufficient, but there are also important use cases ebpf could make
> > > life much easier.
> > >
> > > Most importantly, we strongly believe that ebpf filters can be
> > > supported without reducing security.
> > >
> > > No worries about “party pooping” and we appreciate the feedback. We’d
> > > love to hear concerns and collect feedback so we can address them to
> > > hit that very high bar.
> > >
> > >
> > > ~t
> > >
> > > --
> > > Tianyin Xu
> > > University of Illinois at Urbana-Champaign
> > > https://urldefense.com/v3/__https://tianyin.github.io/__;!!DZ3fjg!o4__Ob32oapUDg9_f6hzksoFiX9517CJ5-w8qtG9i-WKFs_xWbGQfUHpLjHjCddw$
>

[1]: https://lore.kernel.org/lkml/20210215152404.250281-1-andrealmeid@collabora.com/T/
Kees Cook June 1, 2021, 7:55 p.m. UTC | #14
On Thu, May 20, 2021 at 10:56:13AM +0200, Christian Brauner wrote:
> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
> > On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
> > >
> > > On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
> > > > On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > > > > On 5/10/21 10:21 PM, YiFei Zhu wrote:
> > > > > > On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > > >> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
> > > > > >>>
> > > > > >>> From: YiFei Zhu <yifeifz2@illinois.edu>
> > > > > >>>
> > > > > >>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
> > > > > >>>
> > > > > >>> This patchset enables seccomp filters to be written in eBPF.

Before I dive in, I do want to say that this is very interesting work.
Thanks for working on it, even if we're all so grumpy about accepting
it. :)

> > > > > >>> Supporting eBPF filters has been proposed a few times in the past.
> > > > > >>> The main concerns were (1) use cases and (2) security. We have
> > > > > >>> identified many use cases that can benefit from advanced eBPF
> > > > > >>> filters, such as:
> > > > > >>
> > > > > >> I haven't reviewed this carefully, but I think we need to distinguish
> > > > > >> a few things:
> > > > > >>
> > > > > >> 1. Using the eBPF *language*.

Likely everyone is aware, but I'll point out for anyone new reading this
thread: seccomp uses eBPF under the hood: all the cBPF is transformed to
eBPF at filter attach time. But yes, I get the point: using the _entire_
eBPF language. Though I'd remind folks that seccomp doesn't even use
the entire cBPF language...

> [...] but Andy's point stands that this brings a slew of issues on
> the table that need clear answers. Bringing stateful ebpf features into
> seccomp is a pretty big step and especially around the
> privilege/security model it looks pretty handwavy right now.

This is the blocker as far as I'm concerned: there is no story for
unprivileged eBPF. And even IF there was a story there, I find the rate
of security-related flaws in eBPF to be way too high for a sandboxing
primitive to depend on. There have been around a dozen a year for the
last 4 years:

$ git log --oneline --no-merges --pretty=format:'%as %h %s' \
   -i -E \ --all-match --grep '^Fixes:' --grep \
   '(over|under)flow|\bleak|escalat|expos(e[ds]?|ure)\b|use[- ]?after[- ]?free' \
   -- kernel/bpf/ | cut -d- -f1 | sort | uniq -c
      4 2015
      4 2016
     13 2017
     16 2018
     18 2019
     12 2020
      6 2021

I just can't bring myself to accept that level of risk for seccomp. (And
yes, this might be mitigated by blocking the bpf() syscall within a
filter, but then eBPF seccomp would become kind of useless inside a
container launcher, etc etc.)

-Kees
Jinghao Jia June 9, 2021, 6:27 a.m. UTC | #15
On 5/20/21 3:56 AM, Christian Brauner wrote:
> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
>> On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>>> On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
>>>> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>>> On 5/10/21 10:21 PM, YiFei Zhu wrote:
>>>>>> On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>>>>>>> From: YiFei Zhu <yifeifz2@illinois.edu>
>>>>>>>>
>>>>>>>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
>>>>>>>>
>>>>>>>> This patchset enables seccomp filters to be written in eBPF.
>>>>>>>> Supporting eBPF filters has been proposed a few times in the past.
>>>>>>>> The main concerns were (1) use cases and (2) security. We have
>>>>>>>> identified many use cases that can benefit from advanced eBPF
>>>>>>>> filters, such as:
>>>>>>> I haven't reviewed this carefully, but I think we need to distinguish
>>>>>>> a few things:
>>>>>>>
>>>>>>> 1. Using the eBPF *language*.
>>>>>>>
>>>>>>> 2. Allowing the use of stateful / non-pure eBPF features.
>>>>>>>
>>>>>>> 3. Allowing the eBPF programs to read the target process' memory.
>>>>>>>
>>>>>>> I'm generally in favor of (1).  I'm not at all sure about (2), and I'm
>>>>>>> even less convinced by (3).
>>>>>>>
>>>>>>>>    * exec-only-once filter / apply filter after exec
>>>>>>> This is (2).  I'm not sure it's a good idea.
>>>>>> The basic idea is that for a container runtime it may wait to execute
>>>>>> a program in a container without that program being able to execve
>>>>>> another program, stopping any attack that involves loading another
>>>>>> binary. The container runtime can block any syscall but execve in the
>>>>>> exec-ed process by using only cBPF.
>>>>>>
>>>>>> The use case is suggested by Andrea Arcangeli and Giuseppe Scrivano.
>>>>>> @Andrea and @Giuseppe, could you clarify more in case I missed
>>>>>> something?
>>>>> We've discussed having a notifier-using filter be able to replace its
>>>>> filter.  This would allow this and other use cases without any
>>>>> additional eBPF or cBPF code.
>>>>>
>>>> A notifier is not always a solution (even ignoring its perf overhead).
>>>>
>>>> One problem, pointed out by Andrea Arcangeli, is that notifiers need
>>>> userspace daemons. So, it can hardly be used by daemonless container
>>>> engines like Podman.
>>> I'm not sure I buy this argument. Podman already has a conmon instance
>>> for each container, this could be a child of that conmon process, or
>>> live inside conmon itself.
>>>
>>> Tycho
>> I checked with Andrea Arcangeli and Giuseppe Scrivano who are working on Podman.
>>
>> You are right that Podman is not completely daemonless. However, “the
>> fact it's no entirely daemonless doesn't imply it's a good idea to
>> make it worse and to add complexity to the background conmon daemon or
>> to add more daemons.”
>>
>> TL;DR. User notifiers are surely more flexible, but are also more
>> expensive and complex to implement, compared with ebpf filters. /*
>> I’ll reply to Sargun’s performance argument in a separate email */
>>
>> I'm sure you know Podman well, but let me still move some jade from
>> Andrea and Giuseppe (all credits on podmon/crun are theirs) to
>> elaborate the point, for folks cced on the list who are not very
>> familiar with Podman.
>>
>> Basically, the current order goes as follows:
>>
>>           podman -> conmon -> crun -> container_binary
>>                                 \
>>                                  - seccomp done at crun level, not conmon
>>
>> At runtime, what's left is:
>>
>>           conmon -> container_binary  /* podman disappears; crun disappears */
>>
>> So, to go through and use seccomp notify to block `exec`, we can
>> either start the container_binary with a seccomp agent wrapper, or
>> bloat the common binary (as pointed out by Tycho).
>>
>> If we go with the first approach, we will have:
>>
>>           podman -> conmon -> crun -> seccomp_agent -> container_binary
>>
>> So, at runtime we'd be left with one more daemon:
>>
>>          conmon -> seccomp_agent -> container_binary
> That seems like a strawman. I don't see why this has to be out of
> process or a separate daemon. Conmon uses a regular event loop. Adding
> support for processing notifier syscall notifications is
> straightforward. Moving it to a plugin as you mentioned below is a
> design decision not a necessity.
>
>> Apparently, nobody likes one more daemon. So, the proposal from
> I'm not sure such a blanket statements about an indeterminate group of
> people's alleged preferences constitutes a technical argument wny we
> need ebpf in seccomp.
>
>> Giuseppe was/is to use user notifiers as plugins (.so) loaded by
>> conmon:
>> https://urldefense.com/v3/__https://github.com/containers/conmon/pull/190__;!!DZ3fjg!qjoih4kOsHD09Yg41YKmYQrW_YhB3AzV0sgWZsRK621KIf7eTKiMMhAiew-ySWA_vbUt$
>> https://urldefense.com/v3/__https://github.com/containers/crun/pull/438__;!!DZ3fjg!qjoih4kOsHD09Yg41YKmYQrW_YhB3AzV0sgWZsRK621KIf7eTKiMMhAiew-ySfWBbnxD$
>>
>> Now, with the ebpf filter support, one can implement the same thing
>> using an embarrassingly simple ebpf filter and, thanks to Giuseppe,
>> this is well supported by crun.
> So I think this is trying to jump the gun by saying "Look, the result
> might be simpler.". That may even be the case - though I'm not yet
> convinced - but Andy's point stands that this brings a slew of issues on
> the table that need clear answers. Bringing stateful ebpf features into
> seccomp is a pretty big step and especially around the
> privilege/security model it looks pretty handwavy right now.
For the privilege/security model, I assume that you are referring to a 
way to safely do unprivileged ebpf and to make sure the ebpf features 
can be used by seccomp securely.

In fact, the privilege model is carefully implemented in the patch set . 
As mentioned in the cover letter, we followed the security model of user 
notifier and ptrace in a way that our implementation is as restrictive 
as them. Let me elaborate:

1. We require no less privilege than Seccomp or eBPF individually, (e.g. 
filter loading and uses of BPF helpers)

2. The new seccomp_extended LSM hook restricts the use of advanced bpf 
features (maps and helpers). Only when the hook permits the access can 
filters use standard helpers. The LSM hook is implemented in Yama and 
uses ptrace_scope to determine whether to allow access. This is based on 
the idea of reduction to ptrace, as the eBPF filters can instrument the 
process similar to ptrace.

3. The tracing helpers require additional capabilities (CAP_BPF and 
CAP_PERFMON).

4. For user-memory reading, we require CAP_PTRACE to read memory of 
non-dumpable processes. If the capability is not fulfilled, the 
bpf_user_probe{,str} helper would return -EPERM. This is, again, 
reduction to ptrace.

We acknowledge the concerns about user namespace pointed out by Alexei 
Starovoitov. We are more than happy to roll out the solution in the V2 
patch.

Best,
Jinghao

> Christian
Jinghao Jia June 9, 2021, 6:32 a.m. UTC | #16
On 6/1/21 2:55 PM, Kees Cook wrote:
> On Thu, May 20, 2021 at 10:56:13AM +0200, Christian Brauner wrote:
>> On Thu, May 20, 2021 at 03:16:10AM -0500, Tianyin Xu wrote:
>>> On Mon, May 17, 2021 at 10:40 AM Tycho Andersen <tycho@tycho.pizza> wrote:
>>>> On Sun, May 16, 2021 at 03:38:00AM -0500, Tianyin Xu wrote:
>>>>> On Sat, May 15, 2021 at 10:49 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>> On 5/10/21 10:21 PM, YiFei Zhu wrote:
>>>>>>> On Mon, May 10, 2021 at 12:47 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>>> On Mon, May 10, 2021 at 10:22 AM YiFei Zhu <zhuyifei1999@gmail.com> wrote:
>>>>>>>>> From: YiFei Zhu <yifeifz2@illinois.edu>
>>>>>>>>>
>>>>>>>>> Based on: https://urldefense.com/v3/__https://lists.linux-foundation.org/pipermail/containers/2018-February/038571.html__;!!DZ3fjg!thbAoRgmCeWjlv0qPDndNZW1j6Y2Kl_huVyUffr4wVbISf-aUiULaWHwkKJrNJyo$
>>>>>>>>>
>>>>>>>>> This patchset enables seccomp filters to be written in eBPF.
> Before I dive in, I do want to say that this is very interesting work.
> Thanks for working on it, even if we're all so grumpy about accepting
> it. :)
>
>>>>>>>>> Supporting eBPF filters has been proposed a few times in the past.
>>>>>>>>> The main concerns were (1) use cases and (2) security. We have
>>>>>>>>> identified many use cases that can benefit from advanced eBPF
>>>>>>>>> filters, such as:
>>>>>>>> I haven't reviewed this carefully, but I think we need to distinguish
>>>>>>>> a few things:
>>>>>>>>
>>>>>>>> 1. Using the eBPF *language*.
> Likely everyone is aware, but I'll point out for anyone new reading this
> thread: seccomp uses eBPF under the hood: all the cBPF is transformed to
> eBPF at filter attach time. But yes, I get the point: using the _entire_
> eBPF language. Though I'd remind folks that seccomp doesn't even use
> the entire cBPF language...
>
>> [...] but Andy's point stands that this brings a slew of issues on
>> the table that need clear answers. Bringing stateful ebpf features into
>> seccomp is a pretty big step and especially around the
>> privilege/security model it looks pretty handwavy right now.
> This is the blocker as far as I'm concerned: there is no story for
> unprivileged eBPF. And even IF there was a story there, I find the rate
> of security-related flaws in eBPF to be way too high for a sandboxing
> primitive to depend on. There have been around a dozen a year for the
> last 4 years:
>
> $ git log --oneline --no-merges --pretty=format:'%as %h %s' \
>     -i -E \ --all-match --grep '^Fixes:' --grep \
>     '(over|under)flow|\bleak|escalat|expos(e[ds]?|ure)\b|use[- ]?after[- ]?free' \
>     -- kernel/bpf/ | cut -d- -f1 | sort | uniq -c
>        4 2015
>        4 2016
>       13 2017
>       16 2018
>       18 2019
>       12 2020
>        6 2021
>
> I just can't bring myself to accept that level of risk for seccomp.

I just want to clarify that the patch is not supposed to add more risks 
to seccomp.


The vulnerabilities of ebpf are inherently there as long as ebpf is 
supported, no matter whether Seccomp supports ebpf filters or not. If 
ebpf is of concern, one can turn off ebpf completely and Seccomp ebpf 
won’t be available. Otherwise, the vulnerabilities are in your socket 
filters anyway.

> (And
> yes, this might be mitigated by blocking the bpf() syscall within a
> filter, but then eBPF seccomp would become kind of useless inside a
> container launcher, etc etc.)

This is an interesting point. I think the main concern is still about 
the additional risks (which I responded above).


I responded to Christian Brauner earlier about the security model. 
Basically, the implementation is as restrictive as user notifier and 
ptrace. For example, if CAP_BPF is not there, the container won’t be 
able to load ebpf filters with tracing helpers.


In fact, one can load the ebpf filter first and then block the bpf 
syscall (if it makes things more secure).


Best,

Jinghao

>
> -Kees
>