mbox series

[v6,bpf-next,0/3] Introduce CAP_BPF

Message ID 20200513031930.86895-1-alexei.starovoitov@gmail.com (mailing list archive)
Headers show
Series Introduce CAP_BPF | expand

Message

Alexei Starovoitov May 13, 2020, 3:19 a.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

v5->v6:
- split allow_ptr_leaks into four flags.
- retain bpf_jit_limit under cap_sys_admin.
- fixed few other issues spotted by Daniel.

v4->v5:

Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of
CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN.

The user process has to have
- CAP_BPF to create maps and do other sys_bpf() commands
- CAP_BPF and CAP_PERFMON to load tracing programs.
- CAP_BPF and CAP_NET_ADMIN to load networking programs.
(or CAP_SYS_ADMIN for backward compatibility).

CAP_BPF solves three main goals:
1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
   More on this below. This is the major difference vs v4 set back from Sep 2019.
2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
   prevents pointer leaks and arbitrary kernel memory access.
3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
   and making BPF infra more secure. Currently fuzzers run in unpriv.
   They will be able to run with CAP_BPF.

The patchset is long overdue follow-up from the last plumbers conference.
Comparing to what was discussed at LPC the CAP* checks at attach time are gone.
For tracing progs the CAP_SYS_ADMIN check was done at load time only. There was
no check at attach time. For networking and cgroup progs CAP_SYS_ADMIN was
required at load time and CAP_NET_ADMIN at attach time, but there are several
ways to bypass CAP_NET_ADMIN:
- if networking prog is using tail_call writing FD into prog_array will
  effectively attach it, but bpf_map_update_elem is an unprivileged operation.
- freplace prog with CAP_SYS_ADMIN can replace networking prog

Consolidating all CAP checks at load time makes security model similar to
open() syscall. Once the user got an FD it can do everything with it.
read/write/poll don't check permissions. The same way when bpf_prog_load
command returns an FD the user can do everything (including attaching,
detaching, and bpf_test_run).

The important design decision is to allow ID->FD transition for
CAP_SYS_ADMIN only. What it means that user processes can run
with CAP_BPF and CAP_NET_ADMIN and they will not be able to affect each
other unless they pass FDs via scm_rights or via pinning in bpffs.
ID->FD is a mechanism for human override and introspection.
An admin can do 'sudo bpftool prog ...'. It's possible to enforce via LSM that
only bpftool binary does bpf syscall with CAP_SYS_ADMIN and the rest of user
space processes do bpf syscall with CAP_BPF isolating bpf objects (progs, maps,
links) that are owned by such processes from each other.

Another significant change from LPC is that the verifier checks are split into
four flags. The allow_ptr_leaks flag allows pointer manipulations. The
bpf_capable flag enables all modern verifier features like bpf-to-bpf calls,
BTF, bounded loops, dead code elimination, etc. All the goodness. The
bypass_spec_v1 flag enables indirect stack access from bpf programs and
disables speculative analysis and bpf array mitigations. The bypass_spec_v4
flag disables store sanitation. That allows networking progs with CAP_BPF +
CAP_NET_ADMIN enjoy modern verifier features while being more secure.

Some networking progs may need CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON,
since subtracting pointers (like skb->data_end - skb->data) is a pointer leak,
but the verifier may get smarter in the future.

Please see patches for more details.

Alexei Starovoitov (3):
  bpf, capability: Introduce CAP_BPF
  bpf: implement CAP_BPF
  selftests/bpf: use CAP_BPF and CAP_PERFMON in tests

 drivers/media/rc/bpf-lirc.c                   |  2 +-
 include/linux/bpf.h                           | 18 +++-
 include/linux/bpf_verifier.h                  |  3 +
 include/linux/capability.h                    |  5 ++
 include/uapi/linux/capability.h               | 34 +++++++-
 kernel/bpf/arraymap.c                         | 10 +--
 kernel/bpf/bpf_struct_ops.c                   |  2 +-
 kernel/bpf/core.c                             |  2 +-
 kernel/bpf/cpumap.c                           |  2 +-
 kernel/bpf/hashtab.c                          |  4 +-
 kernel/bpf/helpers.c                          |  4 +-
 kernel/bpf/lpm_trie.c                         |  2 +-
 kernel/bpf/map_in_map.c                       |  2 +-
 kernel/bpf/queue_stack_maps.c                 |  2 +-
 kernel/bpf/reuseport_array.c                  |  2 +-
 kernel/bpf/stackmap.c                         |  2 +-
 kernel/bpf/syscall.c                          | 87 ++++++++++++++-----
 kernel/bpf/verifier.c                         | 37 ++++----
 kernel/trace/bpf_trace.c                      |  3 +
 net/core/bpf_sk_storage.c                     |  4 +-
 net/core/filter.c                             |  4 +-
 security/selinux/include/classmap.h           |  4 +-
 tools/testing/selftests/bpf/test_verifier.c   | 44 ++++++++--
 tools/testing/selftests/bpf/verifier/calls.c  | 16 ++--
 .../selftests/bpf/verifier/dead_code.c        | 10 +--
 25 files changed, 221 insertions(+), 84 deletions(-)

Comments

Marek Majkowski May 13, 2020, 10:50 a.m. UTC | #1
On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> CAP_BPF solves three main goals:
> 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
>    More on this below. This is the major difference vs v4 set back from Sep 2019.
> 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
>    prevents pointer leaks and arbitrary kernel memory access.
> 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
>    and making BPF infra more secure. Currently fuzzers run in unpriv.
>    They will be able to run with CAP_BPF.
>

Alexei, looking at this from a user point of view, this looks fine.

I'm slightly worried about REUSEPORT_EBPF. Currently without your
patch, as far as I understand it:

- You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
permissions

- For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
creation CAP_SYS_ADMIN is needed. But again, no permissions check for
SO_ATTACH_REUSEPORT_EBPF later.

If I read the patchset correctly, the former SOCKET_FILTER case
remains as it is and is not affected in any way by presence or absence
of CAP_BPF.

The latter case is different. Presence of CAP_BPF is sufficient for
map creation, but not sufficient for loading SK_REUSEPORT program. It
still requires CAP_SYS_ADMIN. I think it's a good opportunity to relax
this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.

Our specific use case is simple - we want an application program -
like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
but we don't want to grant it CAP_SYS_ADMIN.

Marek
Alexei Starovoitov May 13, 2020, 5:53 p.m. UTC | #2
On Wed, May 13, 2020 at 11:50:42AM +0100, Marek Majkowski wrote:
> On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > CAP_BPF solves three main goals:
> > 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
> >    More on this below. This is the major difference vs v4 set back from Sep 2019.
> > 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
> >    prevents pointer leaks and arbitrary kernel memory access.
> > 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
> >    and making BPF infra more secure. Currently fuzzers run in unpriv.
> >    They will be able to run with CAP_BPF.
> >
> 
> Alexei, looking at this from a user point of view, this looks fine.
> 
> I'm slightly worried about REUSEPORT_EBPF. Currently without your
> patch, as far as I understand it:
> 
> - You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
> permissions

correct.

> - For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
> creation CAP_SYS_ADMIN is needed. But again, no permissions check for
> SO_ATTACH_REUSEPORT_EBPF later.

correct. With clarification that attaching process needs to own
FD of prog and FD of socket.

> If I read the patchset correctly, the former SOCKET_FILTER case
> remains as it is and is not affected in any way by presence or absence
> of CAP_BPF.

correct. As commit log says:
"Existing unprivileged BPF operations are not affected."

> The latter case is different. Presence of CAP_BPF is sufficient for
> map creation, but not sufficient for loading SK_REUSEPORT program. It
> still requires CAP_SYS_ADMIN. 

Not quite.
The patch will allow BPF_PROG_TYPE_SK_REUSEPORT progs to be loaded
with CAP_BPF + CAP_NET_ADMIN.
Since this type of progs is clearly networking type I figured it's
better to be consistent with the rest of networking types.
Two unpriv types SOCKET_FILTER and CGROUP_SKB is the only exception.

> I think it's a good opportunity to relax
> this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
> be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.
> 
> Our specific use case is simple - we want an application program -
> like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
> but we don't want to grant it CAP_SYS_ADMIN.

You'll be able to grant nginx CAP_BPF + CAP_NET_ADMIN to load SK_REUSEPORT
and unpriv child process will be able to attach just like before if
it has right FDs.
I suspect your load balancer needs CAP_NET_ADMIN already anyway due to
use of XDP and TC progs.
So granting CAP_BPF + CAP_NET_ADMIN should cover all bpf prog needs.
Does it address your concern?
Marek Majkowski May 13, 2020, 6:30 p.m. UTC | #3
On Wed, May 13, 2020 at 6:53 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, May 13, 2020 at 11:50:42AM +0100, Marek Majkowski wrote:
> > On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > CAP_BPF solves three main goals:
> > > 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
> > >    More on this below. This is the major difference vs v4 set back from Sep 2019.
> > > 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
> > >    prevents pointer leaks and arbitrary kernel memory access.
> > > 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
> > >    and making BPF infra more secure. Currently fuzzers run in unpriv.
> > >    They will be able to run with CAP_BPF.
> > >
> >
> > Alexei, looking at this from a user point of view, this looks fine.
> >
> > I'm slightly worried about REUSEPORT_EBPF. Currently without your
> > patch, as far as I understand it:
> >
> > - You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
> > permissions
>
> correct.
>
> > - For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
> > creation CAP_SYS_ADMIN is needed. But again, no permissions check for
> > SO_ATTACH_REUSEPORT_EBPF later.
>
> correct. With clarification that attaching process needs to own
> FD of prog and FD of socket.
>
> > If I read the patchset correctly, the former SOCKET_FILTER case
> > remains as it is and is not affected in any way by presence or absence
> > of CAP_BPF.
>
> correct. As commit log says:
> "Existing unprivileged BPF operations are not affected."
>
> > The latter case is different. Presence of CAP_BPF is sufficient for
> > map creation, but not sufficient for loading SK_REUSEPORT program. It
> > still requires CAP_SYS_ADMIN.
>
> Not quite.
> The patch will allow BPF_PROG_TYPE_SK_REUSEPORT progs to be loaded
> with CAP_BPF + CAP_NET_ADMIN.
> Since this type of progs is clearly networking type I figured it's
> better to be consistent with the rest of networking types.
> Two unpriv types SOCKET_FILTER and CGROUP_SKB is the only exception.

Ok, this is the controversy. It made sense to restrict SK_REUSEPORT
programs in the past, because programs needed CAP_NET_ADMIN to create
SOCKARRAY anyway. Now we change this and CAP_BPF is sufficient for
maps - I don't see why CAP_BPF is not sufficient for SK_REUSEPORT
programs. From a user point of view I don't get why this additional
CAP_NET_ADMIN is needed.

> > I think it's a good opportunity to relax
> > this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
> > be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.
> >
> > Our specific use case is simple - we want an application program -
> > like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
> > but we don't want to grant it CAP_SYS_ADMIN.
>
> You'll be able to grant nginx CAP_BPF + CAP_NET_ADMIN to load SK_REUSEPORT
> and unpriv child process will be able to attach just like before if
> it has right FDs.
> I suspect your load balancer needs CAP_NET_ADMIN already anyway due to
> use of XDP and TC progs.
> So granting CAP_BPF + CAP_NET_ADMIN should cover all bpf prog needs.
> Does it address your concern?

Load balancer (XDP+TC) is another layer and permissions there are not
a problem. The specific issue is nginx (port 443) and QUIC. QUIC is
UDP and due to the nginx design we must use REUSEPORT groups to
balance the load across workers. This is fine and could be done with a
simple SOCK_FILTER - we don't need to grant nginx any permissions,
apart from CAP_NET_BIND_SERVICE.

We would like to make the REUSEPORT program more complex to take
advantage of REUSEPORT_EBPF for stickyness (restarting server without
interfering with existing flows), we are happy to grant nginx CAP_BPF,
but we are not happy to grant it CAP_NET_ADMIN. Requiring this CAP for
REUSEPORT severely restricts the API usability for us.

In my head REUSEPORT_EBPF is much closer to SOCKET_FILTER. I
understand why it needed capabilities before (map creation) and I
argue these reasons go away in CAP_BPF world. I assume that any
service (with CAP_BPF) should be able to use reuseport to distribute
packets within its own sockets.  Let me know if I'm missing something.

Marek
Alexei Starovoitov May 13, 2020, 6:54 p.m. UTC | #4
On Wed, May 13, 2020 at 07:30:05PM +0100, Marek Majkowski wrote:
> On Wed, May 13, 2020 at 6:53 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Wed, May 13, 2020 at 11:50:42AM +0100, Marek Majkowski wrote:
> > > On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > CAP_BPF solves three main goals:
> > > > 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
> > > >    More on this below. This is the major difference vs v4 set back from Sep 2019.
> > > > 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
> > > >    prevents pointer leaks and arbitrary kernel memory access.
> > > > 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
> > > >    and making BPF infra more secure. Currently fuzzers run in unpriv.
> > > >    They will be able to run with CAP_BPF.
> > > >
> > >
> > > Alexei, looking at this from a user point of view, this looks fine.
> > >
> > > I'm slightly worried about REUSEPORT_EBPF. Currently without your
> > > patch, as far as I understand it:
> > >
> > > - You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
> > > permissions
> >
> > correct.
> >
> > > - For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
> > > creation CAP_SYS_ADMIN is needed. But again, no permissions check for
> > > SO_ATTACH_REUSEPORT_EBPF later.
> >
> > correct. With clarification that attaching process needs to own
> > FD of prog and FD of socket.
> >
> > > If I read the patchset correctly, the former SOCKET_FILTER case
> > > remains as it is and is not affected in any way by presence or absence
> > > of CAP_BPF.
> >
> > correct. As commit log says:
> > "Existing unprivileged BPF operations are not affected."
> >
> > > The latter case is different. Presence of CAP_BPF is sufficient for
> > > map creation, but not sufficient for loading SK_REUSEPORT program. It
> > > still requires CAP_SYS_ADMIN.
> >
> > Not quite.
> > The patch will allow BPF_PROG_TYPE_SK_REUSEPORT progs to be loaded
> > with CAP_BPF + CAP_NET_ADMIN.
> > Since this type of progs is clearly networking type I figured it's
> > better to be consistent with the rest of networking types.
> > Two unpriv types SOCKET_FILTER and CGROUP_SKB is the only exception.
> 
> Ok, this is the controversy. It made sense to restrict SK_REUSEPORT
> programs in the past, because programs needed CAP_NET_ADMIN to create
> SOCKARRAY anyway. 

Not quite. Currently sockarray needs CAP_SYS_ADMIN to create
which makes little sense from security pov.
CAP_BPF relaxes it CAP_BPF or CAP_SYS_ADMIN.

> Now we change this and CAP_BPF is sufficient for
> maps - I don't see why CAP_BPF is not sufficient for SK_REUSEPORT
> programs. From a user point of view I don't get why this additional
> CAP_NET_ADMIN is needed.

That actually bring another point. I'm not changing sock_map,
sock_hash, dev_map requirements yet. All three still require CAP_NET_ADMIN.
We can relax them to CAP_BPF _or_ CAP_NET_ADMIN in the future,
but I'd like to do that in the follow up.

> 
> > > I think it's a good opportunity to relax
> > > this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
> > > be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.
> > >
> > > Our specific use case is simple - we want an application program -
> > > like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
> > > but we don't want to grant it CAP_SYS_ADMIN.
> >
> > You'll be able to grant nginx CAP_BPF + CAP_NET_ADMIN to load SK_REUSEPORT
> > and unpriv child process will be able to attach just like before if
> > it has right FDs.
> > I suspect your load balancer needs CAP_NET_ADMIN already anyway due to
> > use of XDP and TC progs.
> > So granting CAP_BPF + CAP_NET_ADMIN should cover all bpf prog needs.
> > Does it address your concern?
> 
> Load balancer (XDP+TC) is another layer and permissions there are not
> a problem. The specific issue is nginx (port 443) and QUIC. QUIC is
> UDP and due to the nginx design we must use REUSEPORT groups to
> balance the load across workers. This is fine and could be done with a
> simple SOCK_FILTER - we don't need to grant nginx any permissions,
> apart from CAP_NET_BIND_SERVICE.
> 
> We would like to make the REUSEPORT program more complex to take
> advantage of REUSEPORT_EBPF for stickyness (restarting server without
> interfering with existing flows), we are happy to grant nginx CAP_BPF,
> but we are not happy to grant it CAP_NET_ADMIN. Requiring this CAP for
> REUSEPORT severely restricts the API usability for us.
> 
> In my head REUSEPORT_EBPF is much closer to SOCKET_FILTER. I
> understand why it needed capabilities before (map creation) and I
> argue these reasons go away in CAP_BPF world. I assume that any
> service (with CAP_BPF) should be able to use reuseport to distribute
> packets within its own sockets.  Let me know if I'm missing something.

Fair enough. We can include SK_REUSEPORT prog type as part of CAP_BPF alone.
But will it truly achieve what you want?
You still need CAP_NET_ADMIN for sock_hash which you're using.
Are you saying it's part of the different process that has that cap_net_admin
and nginx will be fine with cap_bpf + cap_net_bind_service ?
Marek Majkowski May 13, 2020, 9:14 p.m. UTC | #5
On Wed, May 13, 2020 at 7:54 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 07:30:05PM +0100, Marek Majkowski wrote:
> > On Wed, May 13, 2020 at 6:53 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > > On Wed, May 13, 2020 at 11:50:42AM +0100, Marek Majkowski wrote:
> > > > On Wed, May 13, 2020 at 4:19 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > CAP_BPF solves three main goals:
> > > > > 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
> > > > >    More on this below. This is the major difference vs v4 set back from Sep 2019.
> > > > > 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
> > > > >    prevents pointer leaks and arbitrary kernel memory access.
> > > > > 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
> > > > >    and making BPF infra more secure. Currently fuzzers run in unpriv.
> > > > >    They will be able to run with CAP_BPF.
> > > > >
> > > >
> > > > Alexei, looking at this from a user point of view, this looks fine.
> > > >
> > > > I'm slightly worried about REUSEPORT_EBPF. Currently without your
> > > > patch, as far as I understand it:
> > > >
> > > > - You can load SOCKET_FILTER and SO_ATTACH_REUSEPORT_EBPF without any
> > > > permissions
> > >
> > > correct.
> > >
> > > > - For loading BPF_PROG_TYPE_SK_REUSEPORT program and for SOCKARRAY map
> > > > creation CAP_SYS_ADMIN is needed. But again, no permissions check for
> > > > SO_ATTACH_REUSEPORT_EBPF later.
> > >
> > > correct. With clarification that attaching process needs to own
> > > FD of prog and FD of socket.
> > >
> > > > If I read the patchset correctly, the former SOCKET_FILTER case
> > > > remains as it is and is not affected in any way by presence or absence
> > > > of CAP_BPF.
> > >
> > > correct. As commit log says:
> > > "Existing unprivileged BPF operations are not affected."
> > >
> > > > The latter case is different. Presence of CAP_BPF is sufficient for
> > > > map creation, but not sufficient for loading SK_REUSEPORT program. It
> > > > still requires CAP_SYS_ADMIN.
> > >
> > > Not quite.
> > > The patch will allow BPF_PROG_TYPE_SK_REUSEPORT progs to be loaded
> > > with CAP_BPF + CAP_NET_ADMIN.
> > > Since this type of progs is clearly networking type I figured it's
> > > better to be consistent with the rest of networking types.
> > > Two unpriv types SOCKET_FILTER and CGROUP_SKB is the only exception.
> >
> > Ok, this is the controversy. It made sense to restrict SK_REUSEPORT
> > programs in the past, because programs needed CAP_NET_ADMIN to create
> > SOCKARRAY anyway.
>
> Not quite. Currently sockarray needs CAP_SYS_ADMIN to create
> which makes little sense from security pov.
> CAP_BPF relaxes it CAP_BPF or CAP_SYS_ADMIN.
>
> > Now we change this and CAP_BPF is sufficient for
> > maps - I don't see why CAP_BPF is not sufficient for SK_REUSEPORT
> > programs. From a user point of view I don't get why this additional
> > CAP_NET_ADMIN is needed.
>
> That actually bring another point. I'm not changing sock_map,
> sock_hash, dev_map requirements yet. All three still require CAP_NET_ADMIN.
> We can relax them to CAP_BPF _or_ CAP_NET_ADMIN in the future,
> but I'd like to do that in the follow up.

Agreed, we can discuss relaxation of SOCKMAP in the future.

> > > > I think it's a good opportunity to relax
> > > > this CAP_SYS_ADMIN requirement. I think the presence of CAP_BPF should
> > > > be sufficient for loading BPF_PROG_TYPE_SK_REUSEPORT.
> > > >
> > > > Our specific use case is simple - we want an application program -
> > > > like nginx - to control REUSEPORT programs. We will grant it CAP_BPF,
> > > > but we don't want to grant it CAP_SYS_ADMIN.
> > >
> > > You'll be able to grant nginx CAP_BPF + CAP_NET_ADMIN to load SK_REUSEPORT
> > > and unpriv child process will be able to attach just like before if
> > > it has right FDs.
> > > I suspect your load balancer needs CAP_NET_ADMIN already anyway due to
> > > use of XDP and TC progs.
> > > So granting CAP_BPF + CAP_NET_ADMIN should cover all bpf prog needs.
> > > Does it address your concern?
> >
> > Load balancer (XDP+TC) is another layer and permissions there are not
> > a problem. The specific issue is nginx (port 443) and QUIC. QUIC is
> > UDP and due to the nginx design we must use REUSEPORT groups to
> > balance the load across workers. This is fine and could be done with a
> > simple SOCK_FILTER - we don't need to grant nginx any permissions,
> > apart from CAP_NET_BIND_SERVICE.
> >
> > We would like to make the REUSEPORT program more complex to take
> > advantage of REUSEPORT_EBPF for stickyness (restarting server without
> > interfering with existing flows), we are happy to grant nginx CAP_BPF,
> > but we are not happy to grant it CAP_NET_ADMIN. Requiring this CAP for
> > REUSEPORT severely restricts the API usability for us.
> >
> > In my head REUSEPORT_EBPF is much closer to SOCKET_FILTER. I
> > understand why it needed capabilities before (map creation) and I
> > argue these reasons go away in CAP_BPF world. I assume that any
> > service (with CAP_BPF) should be able to use reuseport to distribute
> > packets within its own sockets.  Let me know if I'm missing something.
>
> Fair enough. We can include SK_REUSEPORT prog type as part of CAP_BPF alone.
> But will it truly achieve what you want?

It will make the security model much more useful and sane for me and
other users of stuff that depends on SK_REUSEPORT (like nginx + UDP).
So yes, long-term it will help. Thanks.

> You still need CAP_NET_ADMIN for sock_hash which you're using.
> Are you saying it's part of the different process that has that cap_net_admin
> and nginx will be fine with cap_bpf + cap_net_bind_service ?

At this moment good old SOCKARRAY is sufficient. Having both SOCKARRAY
and SK_REUSEPORT_EBPF depend only on CAP_BPF is a good start. Thanks
for considering that. We can discuss relaxation of SOCKMAP in the
future.

Marek