mbox series

[net-next,0/4] Add getsockopt(SO_PEERCGROUPID) and fdinfo API to retreive socket's peer cgroup id

Message ID 20250309132821.103046-1-aleksandr.mikhalitsyn@canonical.com (mailing list archive)
Headers show
Series Add getsockopt(SO_PEERCGROUPID) and fdinfo API to retreive socket's peer cgroup id | expand

Message

Aleksandr Mikhalitsyn March 9, 2025, 1:28 p.m. UTC
1. Add socket cgroup id and socket's peer cgroup id in socket's fdinfo
2. Add SO_PEERCGROUPID which allows to retrieve socket's peer cgroup id
3. Add SO_PEERCGROUPID kselftest

Generally speaking, this API allows race-free resolution of socket's peer cgroup id.
Currently, to do that SCM_CREDENTIALS/SCM_PIDFD -> pid -> /proc/<pid>/cgroup sequence
is used which is racy.

As we don't add any new state to the socket itself there is no potential locking issues
or performance problems. We use already existing sk->sk_cgrp_data.

We already have analogical interfaces to retrieve this
information:
- inet_diag: INET_DIAG_CGROUP_ID
- eBPF: bpf_sk_cgroup_id

Having getsockopt() interface makes sense for many applications, because using eBPF is
not always an option, while inet_diag has obvious complexety and performance drawbacks
if we only want to get this specific info for one specific socket.

Idea comes from UAPI kernel group:
https://uapi-group.org/kernel-features/

Huge thanks to Christian Brauner, Lennart Poettering and Luca Boccassi for proposing
and exchanging ideas about this.

Git tree:
https://github.com/mihalicyn/linux/tree/so_peercgroupid

Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Luca Boccassi <bluca@debian.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Shuah Khan <shuah@kernel.org>

Alexander Mikhalitsyn (4):
  net: unix: print cgroup_id and peer_cgroup_id in fdinfo
  net: core: add getsockopt SO_PEERCGROUPID
  tools/testing/selftests/cgroup/cgroup_util: add cg_get_id helper
  tools/testing/selftests/cgroup: add test for SO_PEERCGROUPID

 arch/alpha/include/uapi/asm/socket.h          |   2 +
 arch/mips/include/uapi/asm/socket.h           |   2 +
 arch/parisc/include/uapi/asm/socket.h         |   2 +
 arch/sparc/include/uapi/asm/socket.h          |   2 +
 include/uapi/asm-generic/socket.h             |   2 +
 net/core/sock.c                               |  17 +
 net/unix/af_unix.c                            |  84 +++++
 tools/include/uapi/asm-generic/socket.h       |   2 +
 tools/testing/selftests/cgroup/Makefile       |   2 +
 tools/testing/selftests/cgroup/cgroup_util.c  |  15 +
 tools/testing/selftests/cgroup/cgroup_util.h  |   2 +
 .../selftests/cgroup/test_so_peercgroupid.c   | 308 ++++++++++++++++++
 12 files changed, 440 insertions(+)
 create mode 100644 tools/testing/selftests/cgroup/test_so_peercgroupid.c

Comments

Tejun Heo March 9, 2025, 2:36 p.m. UTC | #1
On Sun, Mar 09, 2025 at 02:28:11PM +0100, Alexander Mikhalitsyn wrote:
> 1. Add socket cgroup id and socket's peer cgroup id in socket's fdinfo
> 2. Add SO_PEERCGROUPID which allows to retrieve socket's peer cgroup id
> 3. Add SO_PEERCGROUPID kselftest
> 
> Generally speaking, this API allows race-free resolution of socket's peer cgroup id.
> Currently, to do that SCM_CREDENTIALS/SCM_PIDFD -> pid -> /proc/<pid>/cgroup sequence
> is used which is racy.
> 
> As we don't add any new state to the socket itself there is no potential locking issues
> or performance problems. We use already existing sk->sk_cgrp_data.
> 
> We already have analogical interfaces to retrieve this
> information:
> - inet_diag: INET_DIAG_CGROUP_ID
> - eBPF: bpf_sk_cgroup_id
> 
> Having getsockopt() interface makes sense for many applications, because using eBPF is
> not always an option, while inet_diag has obvious complexety and performance drawbacks
> if we only want to get this specific info for one specific socket.
> 
> Idea comes from UAPI kernel group:
> https://uapi-group.org/kernel-features/
> 
> Huge thanks to Christian Brauner, Lennart Poettering and Luca Boccassi for proposing
> and exchanging ideas about this.
> 
> Git tree:
> https://github.com/mihalicyn/linux/tree/so_peercgroupid

From cgroup POV:

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.
Christian Brauner March 10, 2025, 8:52 a.m. UTC | #2
On Sun, Mar 09, 2025 at 02:28:11PM +0100, Alexander Mikhalitsyn wrote:
> 1. Add socket cgroup id and socket's peer cgroup id in socket's fdinfo
> 2. Add SO_PEERCGROUPID which allows to retrieve socket's peer cgroup id
> 3. Add SO_PEERCGROUPID kselftest
> 
> Generally speaking, this API allows race-free resolution of socket's peer cgroup id.
> Currently, to do that SCM_CREDENTIALS/SCM_PIDFD -> pid -> /proc/<pid>/cgroup sequence
> is used which is racy.
> 
> As we don't add any new state to the socket itself there is no potential locking issues
> or performance problems. We use already existing sk->sk_cgrp_data.
> 
> We already have analogical interfaces to retrieve this
> information:
> - inet_diag: INET_DIAG_CGROUP_ID
> - eBPF: bpf_sk_cgroup_id
> 
> Having getsockopt() interface makes sense for many applications, because using eBPF is
> not always an option, while inet_diag has obvious complexety and performance drawbacks
> if we only want to get this specific info for one specific socket.
> 
> Idea comes from UAPI kernel group:
> https://uapi-group.org/kernel-features/
> 
> Huge thanks to Christian Brauner, Lennart Poettering and Luca Boccassi for proposing
> and exchanging ideas about this.

Seems fine to me,
Reviewed-by: Christian Brauner <brauner@kernel.org>
Christian Brauner March 10, 2025, 11:27 a.m. UTC | #3
On Mon, Mar 10, 2025 at 09:52:31AM +0100, Christian Brauner wrote:
> On Sun, Mar 09, 2025 at 02:28:11PM +0100, Alexander Mikhalitsyn wrote:
> > 1. Add socket cgroup id and socket's peer cgroup id in socket's fdinfo
> > 2. Add SO_PEERCGROUPID which allows to retrieve socket's peer cgroup id
> > 3. Add SO_PEERCGROUPID kselftest
> > 
> > Generally speaking, this API allows race-free resolution of socket's peer cgroup id.
> > Currently, to do that SCM_CREDENTIALS/SCM_PIDFD -> pid -> /proc/<pid>/cgroup sequence
> > is used which is racy.
> > 
> > As we don't add any new state to the socket itself there is no potential locking issues
> > or performance problems. We use already existing sk->sk_cgrp_data.
> > 
> > We already have analogical interfaces to retrieve this
> > information:
> > - inet_diag: INET_DIAG_CGROUP_ID
> > - eBPF: bpf_sk_cgroup_id
> > 
> > Having getsockopt() interface makes sense for many applications, because using eBPF is
> > not always an option, while inet_diag has obvious complexety and performance drawbacks
> > if we only want to get this specific info for one specific socket.
> > 
> > Idea comes from UAPI kernel group:
> > https://uapi-group.org/kernel-features/
> > 
> > Huge thanks to Christian Brauner, Lennart Poettering and Luca Boccassi for proposing
> > and exchanging ideas about this.
> 
> Seems fine to me,
> Reviewed-by: Christian Brauner <brauner@kernel.org>

One wider conceptual comment.

Starting with v6.15 it is possible to retrieve exit information from
pidfds even after the task has been reaped. So if someone opens a pidfd
via pidfd_open() and that task gets reaped by the parent it is possible
to call PIDFD_INFO_EXIT and you can retrieve the exit status and the
cgroupid of the task that was reaped. That works even after all task
linkage has been removed from struct pid.

The system call api doesn't allow the creation of pidfds for reaped
processes. It wouldn't be possible as the pid number will have already
been released.

Both SO_PEERPIDFD and SO_PASSPIDFD also don't allow the creation of
pidfds for already reaped peers or senders.

But that doesn't have to be the case since we always have the struct pid
available. So it's entirely possible to hand out a pidfd to a reaped
process if it's guaranteed that exit information is available. If it's
not then this would be a bug.

The trick is that when a struct pid is stashed it needs to also allocate
a pidfd inode. That could simply be done by a helper get_pidfs_pid()
which takes a reference to the struct pid and ensures that space for
recording exit information is available.

With that done SO_PEERCGROUPID isn't needed per se as it will be
possible to get the cgroupid and exit status from the pidfd.

From a cursory look that should be possible to do without too much work.
I'm just pointing this out as an alternative.

There's one restriction that this would be subject to that
SO_PEERCGROUPID isn't. The SO_PEERCGROUPID is exposed for any process
whereas PIDFD_GET_INFO ioctls (that includes the PIDFD_INFO_EXIT) option
is only available for processes within the receivers pid namespace
hierarchy.

But in any case, enabling pidfds for such reaped processes might still
be useful since it would mean receivers could get exit information for
pidfds within their pid namespace hierarchy.
Kuniyuki Iwashima March 11, 2025, 7:33 a.m. UTC | #4
From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Date: Sun,  9 Mar 2025 14:28:11 +0100
> 1. Add socket cgroup id and socket's peer cgroup id in socket's fdinfo

Why do you want to add yet another racy interface ?


> 2. Add SO_PEERCGROUPID which allows to retrieve socket's peer cgroup id
> 3. Add SO_PEERCGROUPID kselftest
> 
> Generally speaking, this API allows race-free resolution of socket's peer cgroup id.
> Currently, to do that SCM_CREDENTIALS/SCM_PIDFD -> pid -> /proc/<pid>/cgroup sequence
> is used which is racy.

Few more words about the race (recycling pid ?) would be appreciated.

I somewhat assumed pid is not recycled until all of its pidfd are
close()d, but sounds like no ?


> 
> As we don't add any new state to the socket itself there is no potential locking issues
> or performance problems. We use already existing sk->sk_cgrp_data.
> 
> We already have analogical interfaces to retrieve this
> information:
> - inet_diag: INET_DIAG_CGROUP_ID
> - eBPF: bpf_sk_cgroup_id
> 
> Having getsockopt() interface makes sense for many applications, because using eBPF is
> not always an option, while inet_diag has obvious complexety and performance drawbacks
> if we only want to get this specific info for one specific socket.

If it's limited to the connect()ed peer, I'd add UNIX_DIAG_CGROUP_ID
and UNIX_DIAG_PEER_CGROUP_ID instead.  Then also ss can use that easily.
Christian Brauner March 11, 2025, 12:02 p.m. UTC | #5
On Tue, Mar 11, 2025 at 12:33:48AM -0700, Kuniyuki Iwashima wrote:
> From: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
> Date: Sun,  9 Mar 2025 14:28:11 +0100
> > 1. Add socket cgroup id and socket's peer cgroup id in socket's fdinfo
> 
> Why do you want to add yet another racy interface ?
> 
> 
> > 2. Add SO_PEERCGROUPID which allows to retrieve socket's peer cgroup id
> > 3. Add SO_PEERCGROUPID kselftest
> > 
> > Generally speaking, this API allows race-free resolution of socket's peer cgroup id.
> > Currently, to do that SCM_CREDENTIALS/SCM_PIDFD -> pid -> /proc/<pid>/cgroup sequence
> > is used which is racy.
> 
> Few more words about the race (recycling pid ?) would be appreciated.
> 
> I somewhat assumed pid is not recycled until all of its pidfd are
> close()d, but sounds like no ?

No, that would allow starving the kernel of pid numbers.
pidfds don't pin struct task_struct for a multitude of reasons similar
to how cred->peer or scm->pid don't stash a task_struct but a struct pid.

> 
> 
> > 
> > As we don't add any new state to the socket itself there is no potential locking issues
> > or performance problems. We use already existing sk->sk_cgrp_data.
> > 
> > We already have analogical interfaces to retrieve this
> > information:
> > - inet_diag: INET_DIAG_CGROUP_ID
> > - eBPF: bpf_sk_cgroup_id
> > 
> > Having getsockopt() interface makes sense for many applications, because using eBPF is
> > not always an option, while inet_diag has obvious complexety and performance drawbacks
> > if we only want to get this specific info for one specific socket.
> 
> If it's limited to the connect()ed peer, I'd add UNIX_DIAG_CGROUP_ID
> and UNIX_DIAG_PEER_CGROUP_ID instead.  Then also ss can use that easily.
Michal Koutný March 13, 2025, 10:31 a.m. UTC | #6
Hello.

On Sun, Mar 09, 2025 at 02:28:11PM +0100, Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> wrote:
> As we don't add any new state to the socket itself there is no potential locking issues
> or performance problems. We use already existing sk->sk_cgrp_data.

This is the cgroup where the socket was created in. If such a socket is
fd-passed to another cgroup, SO_PEERCGROUPID may not be what's expected.

> We already have analogical interfaces to retrieve this
> information:
> - inet_diag: INET_DIAG_CGROUP_ID
> - eBPF: bpf_sk_cgroup_id
> 
> Having getsockopt() interface makes sense for many applications, because using eBPF is
> not always an option, while inet_diag has obvious complexety and performance drawbacks
> if we only want to get this specific info for one specific socket.

I'm not that familiar with INET_DIAG_CGROUP_ID but that one sounds like
fit for the purpose of obtaining socket creator's cgroup whereas what is
desired here is slightly different thing -- cgroup of actual sender
through the socket.

> Idea comes from UAPI kernel group:
> https://uapi-group.org/kernel-features/

In theory shortlived (sending) program may reside in shortlived cgroup
and the consumer of SO_PEERCGROUPID (even if it is real sender) would
only have that number to work with.
It doesn't guarantee existence of original cgroup or stable translation
to cgroup path. I think having something like this could be useful but
consumers must still be aware of limitations (and possible switch from
path-oriented to id-oriented work with cgroups).

Thanks,
Michal