diff mbox series

[bpf-next,v1,3/5] bpf: Introduce cgroup iter

Message ID 20220520012133.1217211-4-yosryahmed@google.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: rstat: cgroup hierarchical stats | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for Kernel LATEST on z15 with gcc
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 1833 this patch: 1674
netdev/cc_maintainers success CCed 10 of 10 maintainers
netdev/build_clang fail Errors and warnings before: 184 this patch: 190
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 1684 this patch: 1684
netdev/checkpatch warning WARNING: Possible unnecessary 'out of memory' message WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 fail Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 fail Logs for Kernel LATEST on ubuntu-latest with llvm-15

Commit Message

Yosry Ahmed May 20, 2022, 1:21 a.m. UTC
From: Hao Luo <haoluo@google.com>

Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
iter doesn't iterate a set of kernel objects. Instead, it is supposed to
be parameterized by a cgroup id and prints only that cgroup. So one
needs to specify a target cgroup id when attaching this iter. The target
cgroup's state can be read out via a link of this iter.

Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/bpf.h            |   2 +
 include/uapi/linux/bpf.h       |   6 ++
 kernel/bpf/Makefile            |   3 +
 kernel/bpf/cgroup_iter.c       | 148 +++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |   6 ++
 5 files changed, 165 insertions(+)
 create mode 100644 kernel/bpf/cgroup_iter.c

Comments

Tejun Heo May 20, 2022, 7:41 a.m. UTC | #1
On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
> From: Hao Luo <haoluo@google.com>
> 
> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> be parameterized by a cgroup id and prints only that cgroup. So one
> needs to specify a target cgroup id when attaching this iter. The target
> cgroup's state can be read out via a link of this iter.
> 
> Signed-off-by: Hao Luo <haoluo@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

This could be me not understanding why it's structured this way but it keeps
bothering me that this is adding a cgroup iterator which doesn't iterate
cgroups. If all that's needed is extracting information from a specific
cgroup, why does this need to be an iterator? e.g. why can't I use
BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
rstat, retrieves whatever information necessary and returns that as the
result?

Thanks.
Yosry Ahmed May 20, 2022, 7:58 a.m. UTC | #2
On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
> > From: Hao Luo <haoluo@google.com>
> >
> > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > be parameterized by a cgroup id and prints only that cgroup. So one
> > needs to specify a target cgroup id when attaching this iter. The target
> > cgroup's state can be read out via a link of this iter.
> >
> > Signed-off-by: Hao Luo <haoluo@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> This could be me not understanding why it's structured this way but it keeps
> bothering me that this is adding a cgroup iterator which doesn't iterate
> cgroups. If all that's needed is extracting information from a specific
> cgroup, why does this need to be an iterator? e.g. why can't I use
> BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
> rstat, retrieves whatever information necessary and returns that as the
> result?

I will let Hao and Yonghong reply here as they have a lot more
context, and they had previous discussions about cgroup_iter. I just
want to say that exposing the stats in a file is extremely convenient
for userspace apps. It becomes very similar to reading stats from
cgroupfs. It also makes migrating cgroup stats that we have
implemented in the kernel to BPF a lot easier.

AFAIK there are also discussions about using overlayfs to have links
to the bpffs files in cgroupfs, which makes it even better. So I would
really prefer keeping the approach we have here of reading stats
through a file from userspace. As for how we go about this (and why a
cgroup iterator doesn't iterate cgroups) I will leave this for Hao and
Yonghong to explain the rationale behind it. Ideally we can keep the
same functionality under a more descriptive name/type.

>
> Thanks.
>
> --
> tejun
Tejun Heo May 20, 2022, 8:11 a.m. UTC | #3
Hello,

On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote:
> On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
> > > From: Hao Luo <haoluo@google.com>
> > >
> > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > > be parameterized by a cgroup id and prints only that cgroup. So one
> > > needs to specify a target cgroup id when attaching this iter. The target
> > > cgroup's state can be read out via a link of this iter.
> > >
> > > Signed-off-by: Hao Luo <haoluo@google.com>
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > This could be me not understanding why it's structured this way but it keeps
> > bothering me that this is adding a cgroup iterator which doesn't iterate
> > cgroups. If all that's needed is extracting information from a specific
> > cgroup, why does this need to be an iterator? e.g. why can't I use
> > BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
> > rstat, retrieves whatever information necessary and returns that as the
> > result?
> 
> I will let Hao and Yonghong reply here as they have a lot more
> context, and they had previous discussions about cgroup_iter. I just
> want to say that exposing the stats in a file is extremely convenient
> for userspace apps. It becomes very similar to reading stats from
> cgroupfs. It also makes migrating cgroup stats that we have
> implemented in the kernel to BPF a lot easier.

So, if it were upto me, I'd rather direct energy towards making retrieving
information through TEST_RUN_PROG easier rather than clinging to making
kernel output text. I get that text interface is familiar but it kinda
sucks in many ways.

> AFAIK there are also discussions about using overlayfs to have links
> to the bpffs files in cgroupfs, which makes it even better. So I would
> really prefer keeping the approach we have here of reading stats
> through a file from userspace. As for how we go about this (and why a
> cgroup iterator doesn't iterate cgroups) I will leave this for Hao and
> Yonghong to explain the rationale behind it. Ideally we can keep the
> same functionality under a more descriptive name/type.

My answer would be the same here. You guys seem dead set on making the
kernel emulate cgroup1. I'm not gonna explicitly block that but would
strongly suggest having a longer term view.

If you *must* do the iterator, can you at least make it a proper iterator
which supports seeking? AFAICS there's nothing fundamentally preventing bpf
iterators from supporting seeking. Or is it that you need something which is
pinned to a cgroup so that you can emulate the directory structure?

Thanks.
Tejun Heo May 20, 2022, 11:27 a.m. UTC | #4
On Thu, May 19, 2022 at 10:11:26PM -1000, Tejun Heo wrote:
> If you *must* do the iterator, can you at least make it a proper iterator
> which supports seeking? AFAICS there's nothing fundamentally preventing bpf
> iterators from supporting seeking. Or is it that you need something which is
> pinned to a cgroup so that you can emulate the directory structure?

Or, alternatively, would it be possible to make a TEST_RUN_PROG to output a
text file in bpffs? There just doesn't seem to be anything cgroup specific
that the iterator is doing that can't be done with exposing a couple kfuncs.

Thanks.
Yonghong Song May 20, 2022, 4:29 p.m. UTC | #5
On 5/20/22 1:11 AM, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote:
>> On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote:
>>>
>>> On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
>>>> From: Hao Luo <haoluo@google.com>
>>>>
>>>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
>>>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to
>>>> be parameterized by a cgroup id and prints only that cgroup. So one
>>>> needs to specify a target cgroup id when attaching this iter. The target
>>>> cgroup's state can be read out via a link of this iter.
>>>>
>>>> Signed-off-by: Hao Luo <haoluo@google.com>
>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>>
>>> This could be me not understanding why it's structured this way but it keeps
>>> bothering me that this is adding a cgroup iterator which doesn't iterate
>>> cgroups. If all that's needed is extracting information from a specific
>>> cgroup, why does this need to be an iterator? e.g. why can't I use
>>> BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
>>> rstat, retrieves whatever information necessary and returns that as the
>>> result?
>>
>> I will let Hao and Yonghong reply here as they have a lot more
>> context, and they had previous discussions about cgroup_iter. I just
>> want to say that exposing the stats in a file is extremely convenient
>> for userspace apps. It becomes very similar to reading stats from
>> cgroupfs. It also makes migrating cgroup stats that we have
>> implemented in the kernel to BPF a lot easier.
> 
> So, if it were upto me, I'd rather direct energy towards making retrieving
> information through TEST_RUN_PROG easier rather than clinging to making
> kernel output text. I get that text interface is familiar but it kinda
> sucks in many ways.
> 
>> AFAIK there are also discussions about using overlayfs to have links
>> to the bpffs files in cgroupfs, which makes it even better. So I would
>> really prefer keeping the approach we have here of reading stats
>> through a file from userspace. As for how we go about this (and why a
>> cgroup iterator doesn't iterate cgroups) I will leave this for Hao and
>> Yonghong to explain the rationale behind it. Ideally we can keep the
>> same functionality under a more descriptive name/type.
> 
> My answer would be the same here. You guys seem dead set on making the
> kernel emulate cgroup1. I'm not gonna explicitly block that but would
> strongly suggest having a longer term view.
> 
> If you *must* do the iterator, can you at least make it a proper iterator
> which supports seeking? AFAICS there's nothing fundamentally preventing bpf
> iterators from supporting seeking. Or is it that you need something which is
> pinned to a cgroup so that you can emulate the directory structure?

The current bpf_iter for cgroup is for the google use case
per previous discussion. But I think a generic cgroup bpf iterator
should help as well.

Maybe you can have a bpf program signature like below:

int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup 
*cgrp, struct cgroup *parent_cgrp)

parent_cgrp is NULL when cgrp is the root cgroup.

I would like the bpf program should send the following information to
user space:
    <parent cgroup dir name> <current cgroup dir name>
    <various stats interested by the user>

This way, user space can easily construct the cgroup hierarchy stat like
                            cpu   mem   cpu pressure   mem pressure ...
    cgroup1                 ...
       child1               ...
         grandchild1        ...
       child2               ...
    cgroup 2                ...
       child 3              ...
         ...                ...

the bpf iterator can have additional parameter like
cgroup_id = ... to only call bpf program once with that
cgroup_id if specified.

The kernel part of cgroup_iter can call cgroup_rstat_flush()
before calling cgroup_iter bpf program.

WDYT?

> 
> Thanks.
>
Tejun Heo May 20, 2022, 4:45 p.m. UTC | #6
Hello, Yonghong.

On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> Maybe you can have a bpf program signature like below:
> 
> int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
> struct cgroup *parent_cgrp)
> 
> parent_cgrp is NULL when cgrp is the root cgroup.
> 
> I would like the bpf program should send the following information to
> user space:
>    <parent cgroup dir name> <current cgroup dir name>

I don't think parent cgroup dir name would be sufficient to reconstruct the
path given that multiple cgroups in different subtrees can have the same
name. For live cgroups, userspace can find the path from id (or ino) without
traversing anything by constructing the fhandle, open it open_by_handle_at()
and then reading /proc/self/fd/$FD symlink -
https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
but I'm not sure how much that'd matter given that they aren't visible from
userspace anyway.

>    <various stats interested by the user>
> 
> This way, user space can easily construct the cgroup hierarchy stat like
>                            cpu   mem   cpu pressure   mem pressure ...
>    cgroup1                 ...
>       child1               ...
>         grandchild1        ...
>       child2               ...
>    cgroup 2                ...
>       child 3              ...
>         ...                ...
> 
> the bpf iterator can have additional parameter like
> cgroup_id = ... to only call bpf program once with that
> cgroup_id if specified.
> 
> The kernel part of cgroup_iter can call cgroup_rstat_flush()
> before calling cgroup_iter bpf program.
> 
> WDYT?

Would it work to just pass in @cgrp and provide a group of helpers so that
the program can do whatever it wanna do including looking up the full path
and passing that to userspace?

Thanks.
Hao Luo May 20, 2022, 5:30 p.m. UTC | #7
Hi Tejun,

On Fri, May 20, 2022 at 1:11 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote:
> > On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote:
> > > > From: Hao Luo <haoluo@google.com>
> > > >
> > > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this
> > > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to
> > > > be parameterized by a cgroup id and prints only that cgroup. So one
> > > > needs to specify a target cgroup id when attaching this iter. The target
> > > > cgroup's state can be read out via a link of this iter.
> > > >
> > > > Signed-off-by: Hao Luo <haoluo@google.com>
> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > >
> > > This could be me not understanding why it's structured this way but it keeps
> > > bothering me that this is adding a cgroup iterator which doesn't iterate
> > > cgroups. If all that's needed is extracting information from a specific
> > > cgroup, why does this need to be an iterator? e.g. why can't I use
> > > BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes
> > > rstat, retrieves whatever information necessary and returns that as the
> > > result?
> >
> > I will let Hao and Yonghong reply here as they have a lot more
> > context, and they had previous discussions about cgroup_iter. I just
> > want to say that exposing the stats in a file is extremely convenient
> > for userspace apps. It becomes very similar to reading stats from
> > cgroupfs. It also makes migrating cgroup stats that we have
> > implemented in the kernel to BPF a lot easier.
>
> So, if it were upto me, I'd rather direct energy towards making retrieving
> information through TEST_RUN_PROG easier rather than clinging to making
> kernel output text. I get that text interface is familiar but it kinda
> sucks in many ways.
>

Tejun, could you explain more about the downside of text interfaces
and why TEST_RUN_PROG would address the problems in text output? From
the discussion we had last time, I understand that your concern was
the unstable interface if we introduce bpf files in cgroupfs, so we
are moving toward replicating the directory structure in bpffs. But I
am not sure about the issue of text format output

> > AFAIK there are also discussions about using overlayfs to have links
> > to the bpffs files in cgroupfs, which makes it even better. So I would
> > really prefer keeping the approach we have here of reading stats
> > through a file from userspace. As for how we go about this (and why a
> > cgroup iterator doesn't iterate cgroups) I will leave this for Hao and
> > Yonghong to explain the rationale behind it. Ideally we can keep the
> > same functionality under a more descriptive name/type.
>
> My answer would be the same here. You guys seem dead set on making the
> kernel emulate cgroup1. I'm not gonna explicitly block that but would
> strongly suggest having a longer term view.
>

The reason why Yosry and I are still pushing toward this direction is
that our user space applications rely heavily on extracting
information from text output for cgroups. Please understand that
migrating them from the traditional model to a new model is a bigger
pain. But I agree that if we have a better, concrete solution (for
example, maybe TEST_RUN_PROG) to convince them and help them migrate,
I really would love to contribute and work on it.

> If you *must* do the iterator, can you at least make it a proper iterator
> which supports seeking? AFAICS there's nothing fundamentally preventing bpf
> iterators from supporting seeking. Or is it that you need something which is
> pinned to a cgroup so that you can emulate the directory structure?
>

Yonghong may comment on adding seek for bpf_iter. I would love to
contribute if we are in need of that. Right now, we don't have a use
case that needs seek for bpf_iter, I think. My thought: for cgroups,
we can seek using cgroup id. Maybe, not all kernel objects are
indexable, so seeking doesn't apply there?

Hao
Hao Luo May 20, 2022, 7:42 p.m. UTC | #8
Hi Tejun and Yonghong,

On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> > Maybe you can have a bpf program signature like below:
> >
> > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
> > struct cgroup *parent_cgrp)
> >
> > parent_cgrp is NULL when cgrp is the root cgroup.
> >
> > I would like the bpf program should send the following information to
> > user space:
> >    <parent cgroup dir name> <current cgroup dir name>
>
> I don't think parent cgroup dir name would be sufficient to reconstruct the
> path given that multiple cgroups in different subtrees can have the same
> name. For live cgroups, userspace can find the path from id (or ino) without
> traversing anything by constructing the fhandle, open it open_by_handle_at()
> and then reading /proc/self/fd/$FD symlink -
> https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
> but I'm not sure how much that'd matter given that they aren't visible from
> userspace anyway.
>

Sending cgroup id is better than cgroup dir name, also because IIUC
the path obtained from cgroup id depends on the namespace of the
userspace process. So if the dump file may be potentially read by
processes within a container, it's better to have the output
namespaced IMO.

> >    <various stats interested by the user>
> >
> > This way, user space can easily construct the cgroup hierarchy stat like
> >                            cpu   mem   cpu pressure   mem pressure ...
> >    cgroup1                 ...
> >       child1               ...
> >         grandchild1        ...
> >       child2               ...
> >    cgroup 2                ...
> >       child 3              ...
> >         ...                ...
> >
> > the bpf iterator can have additional parameter like
> > cgroup_id = ... to only call bpf program once with that
> > cgroup_id if specified.

Yep, this should work. We just need to make the cgroup_id parameter
optional. If it is specified when creating bpf_iter_link, we print for
that cgroup only. If it is not specified, we iterate over all cgroups.
If I understand correctly, sounds doable.

> > The kernel part of cgroup_iter can call cgroup_rstat_flush()
> > before calling cgroup_iter bpf program.

Sounds good to me as well. But my knowledge on rstat_flush is limited.
Yosry can give this a try.

>
> Would it work to just pass in @cgrp and provide a group of helpers so that
> the program can do whatever it wanna do including looking up the full path
> and passing that to userspace?
>

My understanding is, yes, doable. If we need the full path information
of a cgroup, helpers or kfuncs are needed.

The userspace needs to specify the identity of the cgroup, when
creating bpf_iter. This identity could be cgroup id or fd. This
identity needs to be converted to cgroup object somewhere before
passing into bpf program to use.
Yosry Ahmed May 20, 2022, 9:18 p.m. UTC | #9
On Fri, May 20, 2022 at 12:43 PM Hao Luo <haoluo@google.com> wrote:
>
> Hi Tejun and Yonghong,
>
> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
> > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> > > Maybe you can have a bpf program signature like below:
> > >
> > > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
> > > struct cgroup *parent_cgrp)
> > >
> > > parent_cgrp is NULL when cgrp is the root cgroup.
> > >
> > > I would like the bpf program should send the following information to
> > > user space:
> > >    <parent cgroup dir name> <current cgroup dir name>
> >
> > I don't think parent cgroup dir name would be sufficient to reconstruct the
> > path given that multiple cgroups in different subtrees can have the same
> > name. For live cgroups, userspace can find the path from id (or ino) without
> > traversing anything by constructing the fhandle, open it open_by_handle_at()
> > and then reading /proc/self/fd/$FD symlink -
> > https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
> > but I'm not sure how much that'd matter given that they aren't visible from
> > userspace anyway.
> >
>
> Sending cgroup id is better than cgroup dir name, also because IIUC
> the path obtained from cgroup id depends on the namespace of the
> userspace process. So if the dump file may be potentially read by
> processes within a container, it's better to have the output
> namespaced IMO.
>
> > >    <various stats interested by the user>
> > >
> > > This way, user space can easily construct the cgroup hierarchy stat like
> > >                            cpu   mem   cpu pressure   mem pressure ...
> > >    cgroup1                 ...
> > >       child1               ...
> > >         grandchild1        ...
> > >       child2               ...
> > >    cgroup 2                ...
> > >       child 3              ...
> > >         ...                ...
> > >
> > > the bpf iterator can have additional parameter like
> > > cgroup_id = ... to only call bpf program once with that
> > > cgroup_id if specified.
>
> Yep, this should work. We just need to make the cgroup_id parameter
> optional. If it is specified when creating bpf_iter_link, we print for
> that cgroup only. If it is not specified, we iterate over all cgroups.
> If I understand correctly, sounds doable.
>
> > > The kernel part of cgroup_iter can call cgroup_rstat_flush()
> > > before calling cgroup_iter bpf program.
>
> Sounds good to me as well. But my knowledge on rstat_flush is limited.
> Yosry can give this a try.
>
> >
> > Would it work to just pass in @cgrp and provide a group of helpers so that
> > the program can do whatever it wanna do including looking up the full path
> > and passing that to userspace?
> >
>
> My understanding is, yes, doable. If we need the full path information
> of a cgroup, helpers or kfuncs are needed.
>
> The userspace needs to specify the identity of the cgroup, when
> creating bpf_iter. This identity could be cgroup id or fd. This
> identity needs to be converted to cgroup object somewhere before
> passing into bpf program to use.


Let's sum up the discussion here, I feel like we are losing track of
the main problem. IIUC the main concern is that cgroup_iter is not
effectively an iterator, it rather dumps information for one cgroup. I
like the suggestion to make it iterate cgroups by default, and an
optional cgroup_id parameter to make it only "iterate" this one
cgroup. IIUC, this cgroup_id parameter would be a link parameter,
similar to the current approach. Basically, we extend the current
patch so that if cgroup_id is not specified the iterator gets called
for all cgroups instead of one. This fixes the problem for our use
case and also keeps cgroup_iter generic enough. Is my understanding
correct? If yes, I don't see a need to flush rstat in the kernel on
behalf of cgroup_iter progs.
Hao Luo May 20, 2022, 9:49 p.m. UTC | #10
Hi Tejun and Yonghong,

On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote:
>
> Hi Tejun and Yonghong,
>
> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
> > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> > >    <various stats interested by the user>
> > >
> > > This way, user space can easily construct the cgroup hierarchy stat like
> > >                            cpu   mem   cpu pressure   mem pressure ...
> > >    cgroup1                 ...
> > >       child1               ...
> > >         grandchild1        ...
> > >       child2               ...
> > >    cgroup 2                ...
> > >       child 3              ...
> > >         ...                ...
> > >
> > > the bpf iterator can have additional parameter like
> > > cgroup_id = ... to only call bpf program once with that
> > > cgroup_id if specified.
>
> Yep, this should work. We just need to make the cgroup_id parameter
> optional. If it is specified when creating bpf_iter_link, we print for
> that cgroup only. If it is not specified, we iterate over all cgroups.
> If I understand correctly, sounds doable.
>

Yonghong, I realized that seek() which Tejun has been calling out, can
be used to specify the target cgroup, rather than adding a new
parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter,
which will instruct read() to return the corresponding cgroup's stats.
On the other hand, reading without calling seek() beforehand will
return all the cgroups.

WDYT?
Alexei Starovoitov May 20, 2022, 10:19 p.m. UTC | #11
On Fri, May 20, 2022 at 02:18:42PM -0700, Yosry Ahmed wrote:
> >
> > The userspace needs to specify the identity of the cgroup, when
> > creating bpf_iter. This identity could be cgroup id or fd. This
> > identity needs to be converted to cgroup object somewhere before
> > passing into bpf program to use.
> 
> 
> Let's sum up the discussion here, I feel like we are losing track of
> the main problem. IIUC the main concern is that cgroup_iter is not
> effectively an iterator, it rather dumps information for one cgroup. I
> like the suggestion to make it iterate cgroups by default, and an
> optional cgroup_id parameter to make it only "iterate" this one
> cgroup.

We have bpf_map iterator that walks all bpf maps.
When map iterator is parametrized with map_fd the iterator walks
all elements of that map.
cgroup iterator should have similar semantics.
When non-parameterized it will walk all cgroups and their descendent
depth first way. I believe that's what Yonghong is proposing.
When parametrized it will start from that particular cgroup and
walk all descendant of that cgroup only.
The bpf prog can stop the iteration right away with ret 1.
Maybe we can add two parameters. One -> cgroup_fd to use and another ->
the order of iteration css_for_each_descendant_pre vs _post.
wdyt?
Yosry Ahmed May 20, 2022, 10:36 p.m. UTC | #12
On Fri, May 20, 2022 at 3:19 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, May 20, 2022 at 02:18:42PM -0700, Yosry Ahmed wrote:
> > >
> > > The userspace needs to specify the identity of the cgroup, when
> > > creating bpf_iter. This identity could be cgroup id or fd. This
> > > identity needs to be converted to cgroup object somewhere before
> > > passing into bpf program to use.
> >
> >
> > Let's sum up the discussion here, I feel like we are losing track of
> > the main problem. IIUC the main concern is that cgroup_iter is not
> > effectively an iterator, it rather dumps information for one cgroup. I
> > like the suggestion to make it iterate cgroups by default, and an
> > optional cgroup_id parameter to make it only "iterate" this one
> > cgroup.
>
> We have bpf_map iterator that walks all bpf maps.
> When map iterator is parametrized with map_fd the iterator walks
> all elements of that map.
> cgroup iterator should have similar semantics.
> When non-parameterized it will walk all cgroups and their descendent
> depth first way. I believe that's what Yonghong is proposing.
> When parametrized it will start from that particular cgroup and
> walk all descendant of that cgroup only.
> The bpf prog can stop the iteration right away with ret 1.
> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> the order of iteration css_for_each_descendant_pre vs _post.
> wdyt?

So basically extend the current patch so that cgroup_id (or cgroup_fd)
is optional, and it specifies where the iteration starts. If not
provided, then we start at root. For our use case where we want the
iterator to only be invoked for one cgroup we make it return 1 to stop
after the first iteration.

I assume an order parameter is also needed to specify "pre" for our
use case to make sure we are starting iteration at the top cgroup (the
one whose cgroup_id is the parameter of the iterator).

Is my understanding correct? If yes, then this sounds very good. It is
generic enough, actually iterates cgroups, and works for our use case.
Tejun Heo May 20, 2022, 10:57 p.m. UTC | #13
Hello,

On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> We have bpf_map iterator that walks all bpf maps.
> When map iterator is parametrized with map_fd the iterator walks
> all elements of that map.
> cgroup iterator should have similar semantics.
> When non-parameterized it will walk all cgroups and their descendent
> depth first way. I believe that's what Yonghong is proposing.
> When parametrized it will start from that particular cgroup and
> walk all descendant of that cgroup only.
> The bpf prog can stop the iteration right away with ret 1.
> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> the order of iteration css_for_each_descendant_pre vs _post.
> wdyt?

Sounds perfectly reasonable to me.

Thanks.
Yonghong Song May 21, 2022, 12:52 a.m. UTC | #14
On 5/20/22 9:45 AM, Tejun Heo wrote:
> Hello, Yonghong.
> 
> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
>> Maybe you can have a bpf program signature like below:
>>
>> int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp,
>> struct cgroup *parent_cgrp)
>>
>> parent_cgrp is NULL when cgrp is the root cgroup.
>>
>> I would like the bpf program should send the following information to
>> user space:
>>     <parent cgroup dir name> <current cgroup dir name>
> 
> I don't think parent cgroup dir name would be sufficient to reconstruct the
> path given that multiple cgroups in different subtrees can have the same
> name. For live cgroups, userspace can find the path from id (or ino) without
> traversing anything by constructing the fhandle, open it open_by_handle_at()
> and then reading /proc/self/fd/$FD symlink -
> https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups
> but I'm not sure how much that'd matter given that they aren't visible from
> userspace anyway.

passing id/ino to user space and then get directory name in userspace
should work just fine.

> 
>>     <various stats interested by the user>
>>
>> This way, user space can easily construct the cgroup hierarchy stat like
>>                             cpu   mem   cpu pressure   mem pressure ...
>>     cgroup1                 ...
>>        child1               ...
>>          grandchild1        ...
>>        child2               ...
>>     cgroup 2                ...
>>        child 3              ...
>>          ...                ...
>>
>> the bpf iterator can have additional parameter like
>> cgroup_id = ... to only call bpf program once with that
>> cgroup_id if specified.
>>
>> The kernel part of cgroup_iter can call cgroup_rstat_flush()
>> before calling cgroup_iter bpf program.
>>
>> WDYT?
> 
> Would it work to just pass in @cgrp and provide a group of helpers so that
> the program can do whatever it wanna do including looking up the full path
> and passing that to userspace?

I am not super familiar with cgroup internals, I guess with cgroup + 
helpers to retrieve stats, or directly expose stats data structure
to bpf program. Either one is okay to me as long as we can get
desired results.

> 
> Thanks.
>
Yonghong Song May 21, 2022, 12:58 a.m. UTC | #15
On 5/20/22 2:49 PM, Hao Luo wrote:
> Hi Tejun and Yonghong,
> 
> On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote:
>>
>> Hi Tejun and Yonghong,
>>
>> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
>>> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
>>>>     <various stats interested by the user>
>>>>
>>>> This way, user space can easily construct the cgroup hierarchy stat like
>>>>                             cpu   mem   cpu pressure   mem pressure ...
>>>>     cgroup1                 ...
>>>>        child1               ...
>>>>          grandchild1        ...
>>>>        child2               ...
>>>>     cgroup 2                ...
>>>>        child 3              ...
>>>>          ...                ...
>>>>
>>>> the bpf iterator can have additional parameter like
>>>> cgroup_id = ... to only call bpf program once with that
>>>> cgroup_id if specified.
>>
>> Yep, this should work. We just need to make the cgroup_id parameter
>> optional. If it is specified when creating bpf_iter_link, we print for
>> that cgroup only. If it is not specified, we iterate over all cgroups.
>> If I understand correctly, sounds doable.
>>
> 
> Yonghong, I realized that seek() which Tejun has been calling out, can
> be used to specify the target cgroup, rather than adding a new
> parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter,
> which will instruct read() to return the corresponding cgroup's stats.
> On the other hand, reading without calling seek() beforehand will
> return all the cgroups.

Currently, seek is not supported for bpf_iter.

const struct file_operations bpf_iter_fops = {
         .open           = iter_open,
         .llseek         = no_llseek,
         .read           = bpf_seq_read,
         .release        = iter_release,
};

But if seek() works, I don't mind to remove this restriction.
But not sure what to seek. Do you mean to provide a cgroup_fd/cgroup_id
as the seek() syscall parameter? This may work.

But considering we have parameterized example (map_fd) and
in the future, we may have other parameterized bpf_iter
(e.g., for one task). Maybe parameter-based approach is better.

> 
> WDYT?
Yonghong Song May 21, 2022, 12:59 a.m. UTC | #16
On 5/20/22 3:57 PM, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
>> We have bpf_map iterator that walks all bpf maps.
>> When map iterator is parametrized with map_fd the iterator walks
>> all elements of that map.
>> cgroup iterator should have similar semantics.
>> When non-parameterized it will walk all cgroups and their descendent
>> depth first way. I believe that's what Yonghong is proposing.
>> When parametrized it will start from that particular cgroup and
>> walk all descendant of that cgroup only.
>> The bpf prog can stop the iteration right away with ret 1.
>> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
>> the order of iteration css_for_each_descendant_pre vs _post.
>> wdyt?
> 
> Sounds perfectly reasonable to me.

This works for me too. Thanks!

> 
> Thanks.
>
Hao Luo May 21, 2022, 2:34 a.m. UTC | #17
On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote:
> On 5/20/22 3:57 PM, Tejun Heo wrote:
> > Hello,
> >
> > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> >> We have bpf_map iterator that walks all bpf maps.
> >> When map iterator is parametrized with map_fd the iterator walks
> >> all elements of that map.
> >> cgroup iterator should have similar semantics.
> >> When non-parameterized it will walk all cgroups and their descendent
> >> depth first way. I believe that's what Yonghong is proposing.
> >> When parametrized it will start from that particular cgroup and
> >> walk all descendant of that cgroup only.
> >> The bpf prog can stop the iteration right away with ret 1.
> >> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> >> the order of iteration css_for_each_descendant_pre vs _post.
> >> wdyt?
> >
> > Sounds perfectly reasonable to me.
>
> This works for me too. Thanks!
>

This sounds good to me. Thanks. Let's try to do it in the next iteration.
Hao Luo May 21, 2022, 2:43 a.m. UTC | #18
On Fri, May 20, 2022 at 5:58 PM Yonghong Song <yhs@fb.com> wrote:
> On 5/20/22 2:49 PM, Hao Luo wrote:
> > Hi Tejun and Yonghong,
> >
> > On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote:
> >>
> >> Hi Tejun and Yonghong,
> >>
> >> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote:
> >>> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote:
> >>>>     <various stats interested by the user>
> >>>>
> >>>> This way, user space can easily construct the cgroup hierarchy stat like
> >>>>                             cpu   mem   cpu pressure   mem pressure ...
> >>>>     cgroup1                 ...
> >>>>        child1               ...
> >>>>          grandchild1        ...
> >>>>        child2               ...
> >>>>     cgroup 2                ...
> >>>>        child 3              ...
> >>>>          ...                ...
> >>>>
> >>>> the bpf iterator can have additional parameter like
> >>>> cgroup_id = ... to only call bpf program once with that
> >>>> cgroup_id if specified.
> >>
> >> Yep, this should work. We just need to make the cgroup_id parameter
> >> optional. If it is specified when creating bpf_iter_link, we print for
> >> that cgroup only. If it is not specified, we iterate over all cgroups.
> >> If I understand correctly, sounds doable.
> >>
> >
> > Yonghong, I realized that seek() which Tejun has been calling out, can
> > be used to specify the target cgroup, rather than adding a new
> > parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter,
> > which will instruct read() to return the corresponding cgroup's stats.
> > On the other hand, reading without calling seek() beforehand will
> > return all the cgroups.
>
> Currently, seek is not supported for bpf_iter.
>
> const struct file_operations bpf_iter_fops = {
>          .open           = iter_open,
>          .llseek         = no_llseek,
>          .read           = bpf_seq_read,
>          .release        = iter_release,
> };
>
> But if seek() works, I don't mind to remove this restriction.
> But not sure what to seek. Do you mean to provide a cgroup_fd/cgroup_id
> as the seek() syscall parameter? This may work.

Yes, passing a cgroup_id as the seek() syscall parameter was what I meant.

Tejun previously requested us to support seek() for a proper iterator.
Since Alexei has a nice solution that all of us have ack'ed, I am not
sure whether we still want to add seek() for bpf_iter as Tejun asked.
I guess not.

>
> But considering we have parameterized example (map_fd) and
> in the future, we may have other parameterized bpf_iter
> (e.g., for one task). Maybe parameter-based approach is better.
>

Acknowledged.

> >
> > WDYT?
Tejun Heo May 21, 2022, 4:53 a.m. UTC | #19
On Fri, May 20, 2022 at 07:43:12PM -0700, Hao Luo wrote:
> Yes, passing a cgroup_id as the seek() syscall parameter was what I meant.
> 
> Tejun previously requested us to support seek() for a proper iterator.
> Since Alexei has a nice solution that all of us have ack'ed, I am not
> sure whether we still want to add seek() for bpf_iter as Tejun asked.
> I guess not.

Yeah, I meant seeking with the ID but it's better to follow the same
convention as other iterators.

Thanks.
Andrii Nakryiko May 23, 2022, 11:58 p.m. UTC | #20
On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote:
>
> On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote:
> > On 5/20/22 3:57 PM, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> > >> We have bpf_map iterator that walks all bpf maps.
> > >> When map iterator is parametrized with map_fd the iterator walks
> > >> all elements of that map.
> > >> cgroup iterator should have similar semantics.
> > >> When non-parameterized it will walk all cgroups and their descendent
> > >> depth first way. I believe that's what Yonghong is proposing.
> > >> When parametrized it will start from that particular cgroup and
> > >> walk all descendant of that cgroup only.
> > >> The bpf prog can stop the iteration right away with ret 1.
> > >> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> > >> the order of iteration css_for_each_descendant_pre vs _post.
> > >> wdyt?
> > >
> > > Sounds perfectly reasonable to me.
> >
> > This works for me too. Thanks!
> >
>
> This sounds good to me. Thanks. Let's try to do it in the next iteration.

Can we, in addition to descendant_pre and descendant_post walk
algorithms also add the one that does ascendants walk (i.e., start
from specified cgroup and walk up to the root cgroup)? I don't have
specific example, but it seems natural to include it for "cgroup
iterator" in general. Hopefully it won't add much code to the
implementation.
Hao Luo May 24, 2022, 12:53 a.m. UTC | #21
On Mon, May 23, 2022 at 4:58 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote:
> > > On 5/20/22 3:57 PM, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> > > >> We have bpf_map iterator that walks all bpf maps.
> > > >> When map iterator is parametrized with map_fd the iterator walks
> > > >> all elements of that map.
> > > >> cgroup iterator should have similar semantics.
> > > >> When non-parameterized it will walk all cgroups and their descendent
> > > >> depth first way. I believe that's what Yonghong is proposing.
> > > >> When parametrized it will start from that particular cgroup and
> > > >> walk all descendant of that cgroup only.
> > > >> The bpf prog can stop the iteration right away with ret 1.
> > > >> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> > > >> the order of iteration css_for_each_descendant_pre vs _post.
> > > >> wdyt?
> > > >
> > > > Sounds perfectly reasonable to me.
> > >
> > > This works for me too. Thanks!
> > >
> >
> > This sounds good to me. Thanks. Let's try to do it in the next iteration.
>
> Can we, in addition to descendant_pre and descendant_post walk
> algorithms also add the one that does ascendants walk (i.e., start
> from specified cgroup and walk up to the root cgroup)? I don't have
> specific example, but it seems natural to include it for "cgroup
> iterator" in general. Hopefully it won't add much code to the
> implementation.

Yep. Sounds reasonable and doable. It's just adding a flag to specify
traversal order, like:

{
  WALK_DESCENDANT_PRE,
  WALK_DESCENDANT_POST,
  WALK_PARENT_UP,
};

In bpf_iter's seq_next(), change the algorithm to yield the parent of
the current cgroup.
Andrii Nakryiko May 24, 2022, 1:30 a.m. UTC | #22
On Mon, May 23, 2022 at 5:53 PM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, May 23, 2022 at 4:58 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote:
> > >
> > > On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote:
> > > > On 5/20/22 3:57 PM, Tejun Heo wrote:
> > > > > Hello,
> > > > >
> > > > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote:
> > > > >> We have bpf_map iterator that walks all bpf maps.
> > > > >> When map iterator is parametrized with map_fd the iterator walks
> > > > >> all elements of that map.
> > > > >> cgroup iterator should have similar semantics.
> > > > >> When non-parameterized it will walk all cgroups and their descendent
> > > > >> depth first way. I believe that's what Yonghong is proposing.
> > > > >> When parametrized it will start from that particular cgroup and
> > > > >> walk all descendant of that cgroup only.
> > > > >> The bpf prog can stop the iteration right away with ret 1.
> > > > >> Maybe we can add two parameters. One -> cgroup_fd to use and another ->
> > > > >> the order of iteration css_for_each_descendant_pre vs _post.
> > > > >> wdyt?
> > > > >
> > > > > Sounds perfectly reasonable to me.
> > > >
> > > > This works for me too. Thanks!
> > > >
> > >
> > > This sounds good to me. Thanks. Let's try to do it in the next iteration.
> >
> > Can we, in addition to descendant_pre and descendant_post walk
> > algorithms also add the one that does ascendants walk (i.e., start
> > from specified cgroup and walk up to the root cgroup)? I don't have
> > specific example, but it seems natural to include it for "cgroup
> > iterator" in general. Hopefully it won't add much code to the
> > implementation.
>
> Yep. Sounds reasonable and doable. It's just adding a flag to specify
> traversal order, like:
>
> {
>   WALK_DESCENDANT_PRE,
>   WALK_DESCENDANT_POST,
>   WALK_PARENT_UP,

Probably something more like BPF_CG_WALK_DESCENDANT_PRE and so on?

> };
>
> In bpf_iter's seq_next(), change the algorithm to yield the parent of
> the current cgroup.
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c107392b0ba7..74c30fe20c23 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -44,6 +44,7 @@  struct kobject;
 struct mem_cgroup;
 struct module;
 struct bpf_func_state;
+struct cgroup;
 
 extern struct idr btf_idr;
 extern spinlock_t btf_idr_lock;
@@ -1581,6 +1582,7 @@  int bpf_obj_get_user(const char __user *pathname, int flags);
 
 struct bpf_iter_aux_info {
 	struct bpf_map *map;
+	struct cgroup *cgroup;
 };
 
 typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0210f85131b3..e5bc40d4bccc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@  union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5965,6 +5968,9 @@  struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..3e563b163d49 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -36,6 +36,9 @@  obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o
 obj-${CONFIG_BPF_LSM} += bpf_lsm.o
 endif
 obj-$(CONFIG_BPF_PRELOAD) += preload/
+ifeq ($(CONFIG_CGROUPS),y)
+obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o
+endif
 
 obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
 $(obj)/relo_core.o: $(srctree)/tools/lib/bpf/relo_core.c FORCE
diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c
new file mode 100644
index 000000000000..86bdfe135d24
--- /dev/null
+++ b/kernel/bpf/cgroup_iter.c
@@ -0,0 +1,148 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2022 Google */
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+
+struct bpf_iter__cgroup {
+	__bpf_md_ptr(struct bpf_iter_meta *, meta);
+	__bpf_md_ptr(struct cgroup *, cgroup);
+};
+
+static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	/* Only one session is supported. */
+	if (*pos > 0)
+		return NULL;
+
+	if (*pos == 0)
+		++*pos;
+
+	return *(struct cgroup **)seq->private;
+}
+
+static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	++*pos;
+	return NULL;
+}
+
+static int cgroup_iter_seq_show(struct seq_file *seq, void *v)
+{
+	struct bpf_iter__cgroup ctx;
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	ctx.meta = &meta;
+	ctx.cgroup = v;
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, false);
+	if (prog)
+		ret = bpf_iter_run_prog(prog, &ctx);
+
+	return ret;
+}
+
+static void cgroup_iter_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static const struct seq_operations cgroup_iter_seq_ops = {
+	.start  = cgroup_iter_seq_start,
+	.next   = cgroup_iter_seq_next,
+	.stop   = cgroup_iter_seq_stop,
+	.show   = cgroup_iter_seq_show,
+};
+
+BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup)
+
+static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	*(struct cgroup **)priv_data = aux->cgroup;
+	return 0;
+}
+
+static const struct bpf_iter_seq_info cgroup_iter_seq_info = {
+	.seq_ops                = &cgroup_iter_seq_ops,
+	.init_seq_private       = cgroup_iter_seq_init,
+	.seq_priv_size          = sizeof(struct cgroup *),
+};
+
+static int bpf_iter_attach_cgroup(struct bpf_prog *prog,
+				  union bpf_iter_link_info *linfo,
+				  struct bpf_iter_aux_info *aux)
+{
+	struct cgroup *cgroup;
+
+	cgroup = cgroup_get_from_id(linfo->cgroup.cgroup_id);
+	if (!cgroup)
+		return -EBUSY;
+
+	aux->cgroup = cgroup;
+	return 0;
+}
+
+static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux)
+{
+	if (aux->cgroup)
+		cgroup_put(aux->cgroup);
+}
+
+static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux,
+					struct seq_file *seq)
+{
+	char *buf;
+
+	seq_printf(seq, "cgroup_id:\t%llu\n", cgroup_id(aux->cgroup));
+
+	buf = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!buf) {
+		seq_puts(seq, "cgroup_path:\n");
+		return;
+	}
+
+	/* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path
+	 * will print nothing.
+	 *
+	 * Cgroup_path is the path in the calliing process's cgroup namespace.
+	 */
+	cgroup_path_ns(aux->cgroup, buf, sizeof(buf),
+		       current->nsproxy->cgroup_ns);
+	seq_printf(seq, "cgroup_path:\t%s\n", buf);
+	kfree(buf);
+}
+
+static int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux,
+					  struct bpf_link_info *info)
+{
+	info->iter.cgroup.cgroup_id = cgroup_id(aux->cgroup);
+	return 0;
+}
+
+DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta,
+		     struct cgroup *cgroup)
+
+static struct bpf_iter_reg bpf_cgroup_reg_info = {
+	.target			= "cgroup",
+	.attach_target		= bpf_iter_attach_cgroup,
+	.detach_target		= bpf_iter_detach_cgroup,
+	.show_fdinfo		= bpf_iter_cgroup_show_fdinfo,
+	.fill_link_info		= bpf_iter_cgroup_fill_link_info,
+	.ctx_arg_info_size	= 1,
+	.ctx_arg_info		= {
+		{ offsetof(struct bpf_iter__cgroup, cgroup),
+		  PTR_TO_BTF_ID },
+	},
+	.seq_info		= &cgroup_iter_seq_info,
+};
+
+static int __init bpf_cgroup_iter_init(void)
+{
+	bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0];
+	return bpf_iter_reg_target(&bpf_cgroup_reg_info);
+}
+
+late_initcall(bpf_cgroup_iter_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0210f85131b3..e5bc40d4bccc 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@  union bpf_iter_link_info {
 	struct {
 		__u32	map_fd;
 	} map;
+	struct {
+		__u64	cgroup_id;
+	} cgroup;
 };
 
 /* BPF syscall commands, see bpf(2) man-page for more details. */
@@ -5965,6 +5968,9 @@  struct bpf_link_info {
 				struct {
 					__u32 map_id;
 				} map;
+				struct {
+					__u64 cgroup_id;
+				} cgroup;
 			};
 		} iter;
 		struct  {