diff mbox series

[RFC,bpf-next,v2,5/5] selftests/bpf: test for pinning for cgroup_view link

Message ID 20220201205534.1962784-6-haoluo@google.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Extend cgroup interface with bpf | expand

Checks

Context Check Description
bpf/vmtest-bpf-next success VM_Test
bpf/vmtest-bpf-next-PR success PR summary
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 5 maintainers not CCed: linux-kselftest@vger.kernel.org john.fastabend@gmail.com kuniyu@amazon.co.jp shuah@kernel.org netdev@vger.kernel.org
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: externs should be avoided in .c files WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: unchecked sscanf return value
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Hao Luo Feb. 1, 2022, 8:55 p.m. UTC
A selftest for demo/testing cgroup_view pinning, directory tagging and
cat cgroup_view pinned objects. The added cgroup_view prog is a naive
use case, which merely reads the cgroup's id.

This selftest introduces two metrics for measuring cgroup-level
scheduling queuing delays:

 - queue_self: queueing delays caused by waiting behind self cgroup.
 - queue_other: queueing delays caused by waiting behind other cgroups.

The selftest collects task-level queuing delays and breaks them into
queue_self delays and queue_other delays. This is done by hooking at
handlers at context switch and wakeup. Only the max queue_self and
queue_other delays are recorded. This breakdown is helpful in analyzing
the cause of long scheduling delays. Large value in queue_self is an
indication of contention that comes from within the cgroup. Large value
in queue_other indicates contention between cgroups.

A new iter prog type cgroup_view is implemented "dump_cgropu_lat", which
reads the recorded delays and dumps the stats through bpffs interfaces.
Specifically, cgroup_view is initially pinned in an empty directory
bpffs, which effectively turns the directory into a mirror of the cgroup
hierarchy. When a new cgroup is created, we manually created a new
directory in bpffs in correspondence. The new directory will contain
a file prepopulated with the pinned cgroup_view object. Reading that
file yields the queue stats of the corresponding cgroup.

Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../selftests/bpf/prog_tests/pinning_cgroup.c | 143 +++++++++++
 tools/testing/selftests/bpf/progs/bpf_iter.h  |   7 +
 .../bpf/progs/bpf_iter_cgroup_view.c          | 232 ++++++++++++++++++
 3 files changed, 382 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c

Comments

Alexei Starovoitov Feb. 3, 2022, 6:04 p.m. UTC | #1
On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> +
> +SEC("iter/cgroup_view")
> +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> +{
> +	struct seq_file *seq = ctx->meta->seq;
> +	struct cgroup *cgroup = ctx->cgroup;
> +	struct wait_lat *lat;
> +	u64 id;
> +
> +	BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> +	lat = bpf_map_lookup_elem(&cgroup_lat, &id);

Looks like "id = cgroup->kn->id" assignment is missing here?

Thanks a lot for this test. It explains the motivation well.

It seems that the patches 1-4 are there to automatically
supply cgroup pointer into bpf_iter__cgroup_view.

Since user space needs to track good part of cgroup dir opreations
can we task it with the job of patches 1-4 as well?
It can register notifier for cgroupfs operations and
do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
with corresponding cgroup_id.
Ideally there is no new 'view' program and it's a subset of 'iter'
bpf program. They're already parametrizable.
When 'iter' is pinned the user space can tell it which object it should
iterate on. The 'view' will be an interator of one element and
argument to it can be cgroup_id.
When user space pins the same 'view' program in a newly created bpffs
directory it will parametrize it with a different cgroup_id.
At the end the same 'view' program will be pinned in multiple directories
with different cgroup_id arguments.
This patch 5 will look very much the same, but patches 1-4 will not be
necessary.
Of course there are races between cgroup create/destroy and bpffs
mkdir, prog pin operatiosn, but they will be there regardless.
The patch 1-4 approach is not race free either.
Will that work?
Hao Luo Feb. 3, 2022, 10:46 p.m. UTC | #2
On Thu, Feb 3, 2022 at 10:04 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> > +
> > +SEC("iter/cgroup_view")
> > +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> > +{
> > +     struct seq_file *seq = ctx->meta->seq;
> > +     struct cgroup *cgroup = ctx->cgroup;
> > +     struct wait_lat *lat;
> > +     u64 id;
> > +
> > +     BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> > +     lat = bpf_map_lookup_elem(&cgroup_lat, &id);
>
> Looks like "id = cgroup->kn->id" assignment is missing here?
>

Ah, yes. I'll fix it.

> Thanks a lot for this test. It explains the motivation well.
>
> It seems that the patches 1-4 are there to automatically
> supply cgroup pointer into bpf_iter__cgroup_view.
>
> Since user space needs to track good part of cgroup dir opreations
> can we task it with the job of patches 1-4 as well?
> It can register notifier for cgroupfs operations and
> do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
> with corresponding cgroup_id.
> Ideally there is no new 'view' program and it's a subset of 'iter'
> bpf program. They're already parametrizable.
> When 'iter' is pinned the user space can tell it which object it should
> iterate on. The 'view' will be an interator of one element and
> argument to it can be cgroup_id.
> When user space pins the same 'view' program in a newly created bpffs
> directory it will parametrize it with a different cgroup_id.
> At the end the same 'view' program will be pinned in multiple directories
> with different cgroup_id arguments.
> This patch 5 will look very much the same, but patches 1-4 will not be
> necessary.
> Of course there are races between cgroup create/destroy and bpffs
> mkdir, prog pin operatiosn, but they will be there regardless.
> The patch 1-4 approach is not race free either.

Right. I tried to minimize the races between cgroupfs and bpffs in
this patchset. The cgroup and kernfs APIs called in this patchset
guarantee that the cgroup and kernfs objects are alive once get. Some
states in the objects such as 'id' should be valid at least.

> Will that work?

Thanks Alexei for the idea.

The parameterization part sounds good. By 'parametrize', do you mean a
variable in iter prog (like the 'pid' variable in bpf_iter_task_vma.c
[1])? or some metadata of the program? I assume it's program's
metadata. Either parameterizing with cgroup_id or passing cgroup
object to the prog should work. The problem is at pinning.

In our use case, we can't ask the users who create cgroups to do the
pinning. Pinning requires root privilege. In our use case, we have
non-root users who can create cgroup directories and still want to
read bpf stats. They can't do pinning by themselves. This is why
inheritance is a requirement for us. With inheritance, they only need
to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
operation is required. Patch 1-4 are needed to implement inheritance.

It's also not a good idea in our use case to add a userspace
privileged process to monitor cgroupfs operations and perform the
pinning. It's more complex and has a higher maintenance cost and
runtime overhead, compared to the solution of asking whoever makes
cgroups to mkdir in bpffs. The other problem is: if there are nodes in
the data center that don't have the userspace process deployed, the
stats will be unavailable, which is a no-no for some of our users.

[1] tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c
Alexei Starovoitov Feb. 4, 2022, 3:33 a.m. UTC | #3
On Thu, Feb 3, 2022 at 2:46 PM Hao Luo <haoluo@google.com> wrote:
>
> On Thu, Feb 3, 2022 at 10:04 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> > > +
> > > +SEC("iter/cgroup_view")
> > > +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> > > +{
> > > +     struct seq_file *seq = ctx->meta->seq;
> > > +     struct cgroup *cgroup = ctx->cgroup;
> > > +     struct wait_lat *lat;
> > > +     u64 id;
> > > +
> > > +     BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> > > +     lat = bpf_map_lookup_elem(&cgroup_lat, &id);
> >
> > Looks like "id = cgroup->kn->id" assignment is missing here?
> >
>
> Ah, yes. I'll fix it.
>
> > Thanks a lot for this test. It explains the motivation well.
> >
> > It seems that the patches 1-4 are there to automatically
> > supply cgroup pointer into bpf_iter__cgroup_view.
> >
> > Since user space needs to track good part of cgroup dir opreations
> > can we task it with the job of patches 1-4 as well?
> > It can register notifier for cgroupfs operations and
> > do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
> > with corresponding cgroup_id.
> > Ideally there is no new 'view' program and it's a subset of 'iter'
> > bpf program. They're already parametrizable.
> > When 'iter' is pinned the user space can tell it which object it should
> > iterate on. The 'view' will be an interator of one element and
> > argument to it can be cgroup_id.
> > When user space pins the same 'view' program in a newly created bpffs
> > directory it will parametrize it with a different cgroup_id.
> > At the end the same 'view' program will be pinned in multiple directories
> > with different cgroup_id arguments.
> > This patch 5 will look very much the same, but patches 1-4 will not be
> > necessary.
> > Of course there are races between cgroup create/destroy and bpffs
> > mkdir, prog pin operatiosn, but they will be there regardless.
> > The patch 1-4 approach is not race free either.
>
> Right. I tried to minimize the races between cgroupfs and bpffs in
> this patchset. The cgroup and kernfs APIs called in this patchset
> guarantee that the cgroup and kernfs objects are alive once get. Some
> states in the objects such as 'id' should be valid at least.
>
> > Will that work?
>
> Thanks Alexei for the idea.
>
> The parameterization part sounds good. By 'parametrize', do you mean a
> variable in iter prog (like the 'pid' variable in bpf_iter_task_vma.c
> [1])? or some metadata of the program? I assume it's program's
> metadata. Either parameterizing with cgroup_id or passing cgroup
> object to the prog should work. The problem is at pinning.

The bpf_iter_link_info is used to parametrize the iterator.
The map iterator will iterate the given map_fd.
iirc pinning is not parameterizable yet,
but that's not difficult to add.


> In our use case, we can't ask the users who create cgroups to do the
> pinning. Pinning requires root privilege. In our use case, we have
> non-root users who can create cgroup directories and still want to
> read bpf stats. They can't do pinning by themselves. This is why
> inheritance is a requirement for us. With inheritance, they only need
> to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> operation is required. Patch 1-4 are needed to implement inheritance.
>
> It's also not a good idea in our use case to add a userspace
> privileged process to monitor cgroupfs operations and perform the
> pinning. It's more complex and has a higher maintenance cost and
> runtime overhead, compared to the solution of asking whoever makes
> cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> the data center that don't have the userspace process deployed, the
> stats will be unavailable, which is a no-no for some of our users.

The commit log says that there will be a daemon that does that
monitoring of cgroupfs. And that daemon needs to mkdir
directories in bpffs when a new cgroup is created, no?
The kernel is only doing inheritance of bpf progs into
new dirs. I think that daemon can pin as well.

The cgroup creation is typically managed by an agent like systemd.
Sounds like you have your own agent that creates cgroups?
If so it has to be privileged and it can mkdir in bpffs and pin too ?
Hao Luo Feb. 4, 2022, 6:26 p.m. UTC | #4
On Thu, Feb 3, 2022 at 7:33 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 3, 2022 at 2:46 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Thu, Feb 3, 2022 at 10:04 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote:
> > > > +
> > > > +SEC("iter/cgroup_view")
> > > > +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
> > > > +{
> > > > +     struct seq_file *seq = ctx->meta->seq;
> > > > +     struct cgroup *cgroup = ctx->cgroup;
> > > > +     struct wait_lat *lat;
> > > > +     u64 id;
> > > > +
> > > > +     BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
> > > > +     lat = bpf_map_lookup_elem(&cgroup_lat, &id);
> > >
> > > Looks like "id = cgroup->kn->id" assignment is missing here?
> > >
> >
> > Ah, yes. I'll fix it.
> >
> > > Thanks a lot for this test. It explains the motivation well.
> > >
> > > It seems that the patches 1-4 are there to automatically
> > > supply cgroup pointer into bpf_iter__cgroup_view.
> > >
> > > Since user space needs to track good part of cgroup dir opreations
> > > can we task it with the job of patches 1-4 as well?
> > > It can register notifier for cgroupfs operations and
> > > do mkdir in bpffs similarly _and_ parametrize 'view' bpf program
> > > with corresponding cgroup_id.
> > > Ideally there is no new 'view' program and it's a subset of 'iter'
> > > bpf program. They're already parametrizable.
> > > When 'iter' is pinned the user space can tell it which object it should
> > > iterate on. The 'view' will be an interator of one element and
> > > argument to it can be cgroup_id.
> > > When user space pins the same 'view' program in a newly created bpffs
> > > directory it will parametrize it with a different cgroup_id.
> > > At the end the same 'view' program will be pinned in multiple directories
> > > with different cgroup_id arguments.
> > > This patch 5 will look very much the same, but patches 1-4 will not be
> > > necessary.
> > > Of course there are races between cgroup create/destroy and bpffs
> > > mkdir, prog pin operatiosn, but they will be there regardless.
> > > The patch 1-4 approach is not race free either.
> >
> > Right. I tried to minimize the races between cgroupfs and bpffs in
> > this patchset. The cgroup and kernfs APIs called in this patchset
> > guarantee that the cgroup and kernfs objects are alive once get. Some
> > states in the objects such as 'id' should be valid at least.
> >
> > > Will that work?
> >
> > Thanks Alexei for the idea.
> >
> > The parameterization part sounds good. By 'parametrize', do you mean a
> > variable in iter prog (like the 'pid' variable in bpf_iter_task_vma.c
> > [1])? or some metadata of the program? I assume it's program's
> > metadata. Either parameterizing with cgroup_id or passing cgroup
> > object to the prog should work. The problem is at pinning.
>
> The bpf_iter_link_info is used to parametrize the iterator.
> The map iterator will iterate the given map_fd.
> iirc pinning is not parameterizable yet,
> but that's not difficult to add.
>

I can take a look at that. This will be useful in our use case.

>
> > In our use case, we can't ask the users who create cgroups to do the
> > pinning. Pinning requires root privilege. In our use case, we have
> > non-root users who can create cgroup directories and still want to
> > read bpf stats. They can't do pinning by themselves. This is why
> > inheritance is a requirement for us. With inheritance, they only need
> > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > operation is required. Patch 1-4 are needed to implement inheritance.
> >
> > It's also not a good idea in our use case to add a userspace
> > privileged process to monitor cgroupfs operations and perform the
> > pinning. It's more complex and has a higher maintenance cost and
> > runtime overhead, compared to the solution of asking whoever makes
> > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > the data center that don't have the userspace process deployed, the
> > stats will be unavailable, which is a no-no for some of our users.
>
> The commit log says that there will be a daemon that does that
> monitoring of cgroupfs. And that daemon needs to mkdir
> directories in bpffs when a new cgroup is created, no?
> The kernel is only doing inheritance of bpf progs into
> new dirs. I think that daemon can pin as well.
>
> The cgroup creation is typically managed by an agent like systemd.
> Sounds like you have your own agent that creates cgroups?
> If so it has to be privileged and it can mkdir in bpffs and pin too ?

Ah, yes, we have our own daemon to manage cgroups. That daemon creates
the top-level cgroup for each job to run inside. However, the job can
create its own cgroups inside the top-level cgroup, for fine grained
resource control. This doesn't go through the daemon. The job-created
cgroups don't have the pinned objects and this is a no-no for our
users.
Alexei Starovoitov Feb. 6, 2022, 4:29 a.m. UTC | #5
On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> >
> > > In our use case, we can't ask the users who create cgroups to do the
> > > pinning. Pinning requires root privilege. In our use case, we have
> > > non-root users who can create cgroup directories and still want to
> > > read bpf stats. They can't do pinning by themselves. This is why
> > > inheritance is a requirement for us. With inheritance, they only need
> > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > operation is required. Patch 1-4 are needed to implement inheritance.
> > >
> > > It's also not a good idea in our use case to add a userspace
> > > privileged process to monitor cgroupfs operations and perform the
> > > pinning. It's more complex and has a higher maintenance cost and
> > > runtime overhead, compared to the solution of asking whoever makes
> > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > the data center that don't have the userspace process deployed, the
> > > stats will be unavailable, which is a no-no for some of our users.
> >
> > The commit log says that there will be a daemon that does that
> > monitoring of cgroupfs. And that daemon needs to mkdir
> > directories in bpffs when a new cgroup is created, no?
> > The kernel is only doing inheritance of bpf progs into
> > new dirs. I think that daemon can pin as well.
> >
> > The cgroup creation is typically managed by an agent like systemd.
> > Sounds like you have your own agent that creates cgroups?
> > If so it has to be privileged and it can mkdir in bpffs and pin too ?
>
> Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> the top-level cgroup for each job to run inside. However, the job can
> create its own cgroups inside the top-level cgroup, for fine grained
> resource control. This doesn't go through the daemon. The job-created
> cgroups don't have the pinned objects and this is a no-no for our
> users.

We can whitelist certain tracepoints to be sleepable and extend
tp_btf prog type to include everything from prog_type_syscall.
Such prog would attach to cgroup_mkdir and cgroup_release
and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
We can allow prog_type_syscall to do mkdir in bpffs as well.

This feature could be useful for similar monitoring/introspection tasks.
We can write a program that would monitor bpf prog load/unload
and would pin an iterator prog that would show debug info about a prog.
Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
With this feature we can implement:
ls /sys/fs/bpf/all_progs.debug/
and each loaded prog would have a corresponding file.
The file name would be a program name, for example.
cat /sys/fs/bpf/all_progs.debug/my_prog
would pretty print info about 'my_prog' bpf program.

This way the kernfs/cgroupfs specific logic from patches 1-4
will not be necessary.

wdyt?
Hao Luo Feb. 8, 2022, 8:07 p.m. UTC | #6
On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> > >
> > > > In our use case, we can't ask the users who create cgroups to do the
> > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > non-root users who can create cgroup directories and still want to
> > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > inheritance is a requirement for us. With inheritance, they only need
> > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > >
> > > > It's also not a good idea in our use case to add a userspace
> > > > privileged process to monitor cgroupfs operations and perform the
> > > > pinning. It's more complex and has a higher maintenance cost and
> > > > runtime overhead, compared to the solution of asking whoever makes
> > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > the data center that don't have the userspace process deployed, the
> > > > stats will be unavailable, which is a no-no for some of our users.
> > >
> > > The commit log says that there will be a daemon that does that
> > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > directories in bpffs when a new cgroup is created, no?
> > > The kernel is only doing inheritance of bpf progs into
> > > new dirs. I think that daemon can pin as well.
> > >
> > > The cgroup creation is typically managed by an agent like systemd.
> > > Sounds like you have your own agent that creates cgroups?
> > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> >
> > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > the top-level cgroup for each job to run inside. However, the job can
> > create its own cgroups inside the top-level cgroup, for fine grained
> > resource control. This doesn't go through the daemon. The job-created
> > cgroups don't have the pinned objects and this is a no-no for our
> > users.
>
> We can whitelist certain tracepoints to be sleepable and extend
> tp_btf prog type to include everything from prog_type_syscall.
> Such prog would attach to cgroup_mkdir and cgroup_release
> and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> We can allow prog_type_syscall to do mkdir in bpffs as well.
>
> This feature could be useful for similar monitoring/introspection tasks.
> We can write a program that would monitor bpf prog load/unload
> and would pin an iterator prog that would show debug info about a prog.
> Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> With this feature we can implement:
> ls /sys/fs/bpf/all_progs.debug/
> and each loaded prog would have a corresponding file.
> The file name would be a program name, for example.
> cat /sys/fs/bpf/all_progs.debug/my_prog
> would pretty print info about 'my_prog' bpf program.
>
> This way the kernfs/cgroupfs specific logic from patches 1-4
> will not be necessary.
>
> wdyt?

Thanks Alexei. I gave it more thought in the last couple of days.
Actually I think it's a good idea, more flexible. It gets rid of the
need of a user space daemon for monitoring cgroup creation and
destruction. We could monitor task creations and exits as well, so
that we can export per-task information (e.g. task_vma_iter) more
efficiently.

A couple of thoughts when thinking about the details:

- Regarding parameterized pinning, I don't think we can have one
single bpf_iter_link object, but with different parameters. Because
parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
time we pin, we have to attach iter in order to get a new link object
first. So we need to add attach and detach in bpf_sys_bpf().
- We also need to add those syscalls for cleanup: (1) unlink for
removing pinned obj and (2) rmdir for removing the directory in
prog_type_syscall.

With these extensions, we can shift some of the bpf operations
currently performed in system daemons into the kernel. IMHO it's a
great thing, making system monitoring more flexible.
Alexei Starovoitov Feb. 8, 2022, 9:20 p.m. UTC | #7
On Tue, Feb 8, 2022 at 12:07 PM Hao Luo <haoluo@google.com> wrote:
>
> On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> > > >
> > > > > In our use case, we can't ask the users who create cgroups to do the
> > > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > > non-root users who can create cgroup directories and still want to
> > > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > > inheritance is a requirement for us. With inheritance, they only need
> > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > > >
> > > > > It's also not a good idea in our use case to add a userspace
> > > > > privileged process to monitor cgroupfs operations and perform the
> > > > > pinning. It's more complex and has a higher maintenance cost and
> > > > > runtime overhead, compared to the solution of asking whoever makes
> > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > > the data center that don't have the userspace process deployed, the
> > > > > stats will be unavailable, which is a no-no for some of our users.
> > > >
> > > > The commit log says that there will be a daemon that does that
> > > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > > directories in bpffs when a new cgroup is created, no?
> > > > The kernel is only doing inheritance of bpf progs into
> > > > new dirs. I think that daemon can pin as well.
> > > >
> > > > The cgroup creation is typically managed by an agent like systemd.
> > > > Sounds like you have your own agent that creates cgroups?
> > > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> > >
> > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > > the top-level cgroup for each job to run inside. However, the job can
> > > create its own cgroups inside the top-level cgroup, for fine grained
> > > resource control. This doesn't go through the daemon. The job-created
> > > cgroups don't have the pinned objects and this is a no-no for our
> > > users.
> >
> > We can whitelist certain tracepoints to be sleepable and extend
> > tp_btf prog type to include everything from prog_type_syscall.
> > Such prog would attach to cgroup_mkdir and cgroup_release
> > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> > We can allow prog_type_syscall to do mkdir in bpffs as well.
> >
> > This feature could be useful for similar monitoring/introspection tasks.
> > We can write a program that would monitor bpf prog load/unload
> > and would pin an iterator prog that would show debug info about a prog.
> > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> > With this feature we can implement:
> > ls /sys/fs/bpf/all_progs.debug/
> > and each loaded prog would have a corresponding file.
> > The file name would be a program name, for example.
> > cat /sys/fs/bpf/all_progs.debug/my_prog
> > would pretty print info about 'my_prog' bpf program.
> >
> > This way the kernfs/cgroupfs specific logic from patches 1-4
> > will not be necessary.
> >
> > wdyt?
>
> Thanks Alexei. I gave it more thought in the last couple of days.
> Actually I think it's a good idea, more flexible. It gets rid of the
> need of a user space daemon for monitoring cgroup creation and
> destruction. We could monitor task creations and exits as well, so
> that we can export per-task information (e.g. task_vma_iter) more
> efficiently.

Yep. Monitoring task creation and exposing via bpf_iter sounds
useful too.

> A couple of thoughts when thinking about the details:
>
> - Regarding parameterized pinning, I don't think we can have one
> single bpf_iter_link object, but with different parameters. Because
> parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
> time we pin, we have to attach iter in order to get a new link object
> first. So we need to add attach and detach in bpf_sys_bpf().

Makes sense.
I'm adding bpf_link_create to bpf_sys_bpf as part of
the "lskel for kernel" patch set.
The detach is sys_close. It's already available.

> - We also need to add those syscalls for cleanup: (1) unlink for
> removing pinned obj and (2) rmdir for removing the directory in
> prog_type_syscall.

Yes. These two would be needed.
And obj_pin too.

> With these extensions, we can shift some of the bpf operations
> currently performed in system daemons into the kernel. IMHO it's a
> great thing, making system monitoring more flexible.

Awesome. Sounds like we're converging :)
Hao Luo Feb. 8, 2022, 9:34 p.m. UTC | #8
On Tue, Feb 8, 2022 at 1:20 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 8, 2022 at 12:07 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> > > > >
> > > > > > In our use case, we can't ask the users who create cgroups to do the
> > > > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > > > non-root users who can create cgroup directories and still want to
> > > > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > > > inheritance is a requirement for us. With inheritance, they only need
> > > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > > > >
> > > > > > It's also not a good idea in our use case to add a userspace
> > > > > > privileged process to monitor cgroupfs operations and perform the
> > > > > > pinning. It's more complex and has a higher maintenance cost and
> > > > > > runtime overhead, compared to the solution of asking whoever makes
> > > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > > > the data center that don't have the userspace process deployed, the
> > > > > > stats will be unavailable, which is a no-no for some of our users.
> > > > >
> > > > > The commit log says that there will be a daemon that does that
> > > > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > > > directories in bpffs when a new cgroup is created, no?
> > > > > The kernel is only doing inheritance of bpf progs into
> > > > > new dirs. I think that daemon can pin as well.
> > > > >
> > > > > The cgroup creation is typically managed by an agent like systemd.
> > > > > Sounds like you have your own agent that creates cgroups?
> > > > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> > > >
> > > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > > > the top-level cgroup for each job to run inside. However, the job can
> > > > create its own cgroups inside the top-level cgroup, for fine grained
> > > > resource control. This doesn't go through the daemon. The job-created
> > > > cgroups don't have the pinned objects and this is a no-no for our
> > > > users.
> > >
> > > We can whitelist certain tracepoints to be sleepable and extend
> > > tp_btf prog type to include everything from prog_type_syscall.
> > > Such prog would attach to cgroup_mkdir and cgroup_release
> > > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> > > We can allow prog_type_syscall to do mkdir in bpffs as well.
> > >
> > > This feature could be useful for similar monitoring/introspection tasks.
> > > We can write a program that would monitor bpf prog load/unload
> > > and would pin an iterator prog that would show debug info about a prog.
> > > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> > > With this feature we can implement:
> > > ls /sys/fs/bpf/all_progs.debug/
> > > and each loaded prog would have a corresponding file.
> > > The file name would be a program name, for example.
> > > cat /sys/fs/bpf/all_progs.debug/my_prog
> > > would pretty print info about 'my_prog' bpf program.
> > >
> > > This way the kernfs/cgroupfs specific logic from patches 1-4
> > > will not be necessary.
> > >
> > > wdyt?
> >
> > Thanks Alexei. I gave it more thought in the last couple of days.
> > Actually I think it's a good idea, more flexible. It gets rid of the
> > need of a user space daemon for monitoring cgroup creation and
> > destruction. We could monitor task creations and exits as well, so
> > that we can export per-task information (e.g. task_vma_iter) more
> > efficiently.
>
> Yep. Monitoring task creation and exposing via bpf_iter sounds
> useful too.
>
> > A couple of thoughts when thinking about the details:
> >
> > - Regarding parameterized pinning, I don't think we can have one
> > single bpf_iter_link object, but with different parameters. Because
> > parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
> > time we pin, we have to attach iter in order to get a new link object
> > first. So we need to add attach and detach in bpf_sys_bpf().
>
> Makes sense.
> I'm adding bpf_link_create to bpf_sys_bpf as part of
> the "lskel for kernel" patch set.
> The detach is sys_close. It's already available.
>
> > - We also need to add those syscalls for cleanup: (1) unlink for
> > removing pinned obj and (2) rmdir for removing the directory in
> > prog_type_syscall.
>
> Yes. These two would be needed.
> And obj_pin too.
>
> > With these extensions, we can shift some of the bpf operations
> > currently performed in system daemons into the kernel. IMHO it's a
> > great thing, making system monitoring more flexible.
>
> Awesome. Sounds like we're converging :)

Right. :)

Thank you for proposing this solution. This really helps! I'm going to
work on a prototype.
Hao Luo Feb. 14, 2022, 6:29 p.m. UTC | #9
On Tue, Feb 8, 2022 at 1:20 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 8, 2022 at 12:07 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo <haoluo@google.com> wrote:
> > > > >
> > > > > > In our use case, we can't ask the users who create cgroups to do the
> > > > > > pinning. Pinning requires root privilege. In our use case, we have
> > > > > > non-root users who can create cgroup directories and still want to
> > > > > > read bpf stats. They can't do pinning by themselves. This is why
> > > > > > inheritance is a requirement for us. With inheritance, they only need
> > > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning
> > > > > > operation is required. Patch 1-4 are needed to implement inheritance.
> > > > > >
> > > > > > It's also not a good idea in our use case to add a userspace
> > > > > > privileged process to monitor cgroupfs operations and perform the
> > > > > > pinning. It's more complex and has a higher maintenance cost and
> > > > > > runtime overhead, compared to the solution of asking whoever makes
> > > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in
> > > > > > the data center that don't have the userspace process deployed, the
> > > > > > stats will be unavailable, which is a no-no for some of our users.
> > > > >
> > > > > The commit log says that there will be a daemon that does that
> > > > > monitoring of cgroupfs. And that daemon needs to mkdir
> > > > > directories in bpffs when a new cgroup is created, no?
> > > > > The kernel is only doing inheritance of bpf progs into
> > > > > new dirs. I think that daemon can pin as well.
> > > > >
> > > > > The cgroup creation is typically managed by an agent like systemd.
> > > > > Sounds like you have your own agent that creates cgroups?
> > > > > If so it has to be privileged and it can mkdir in bpffs and pin too ?
> > > >
> > > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates
> > > > the top-level cgroup for each job to run inside. However, the job can
> > > > create its own cgroups inside the top-level cgroup, for fine grained
> > > > resource control. This doesn't go through the daemon. The job-created
> > > > cgroups don't have the pinned objects and this is a no-no for our
> > > > users.
> > >
> > > We can whitelist certain tracepoints to be sleepable and extend
> > > tp_btf prog type to include everything from prog_type_syscall.
> > > Such prog would attach to cgroup_mkdir and cgroup_release
> > > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs.
> > > We can allow prog_type_syscall to do mkdir in bpffs as well.
> > >
> > > This feature could be useful for similar monitoring/introspection tasks.
> > > We can write a program that would monitor bpf prog load/unload
> > > and would pin an iterator prog that would show debug info about a prog.
> > > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs.
> > > With this feature we can implement:
> > > ls /sys/fs/bpf/all_progs.debug/
> > > and each loaded prog would have a corresponding file.
> > > The file name would be a program name, for example.
> > > cat /sys/fs/bpf/all_progs.debug/my_prog
> > > would pretty print info about 'my_prog' bpf program.
> > >
> > > This way the kernfs/cgroupfs specific logic from patches 1-4
> > > will not be necessary.
> > >
> > > wdyt?

Hi Alexei,

Actually, I found this almost worked, except that the tracepoints
cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
spinlock's critical section with irq off. I guess one solution is to
offload the sleepable part of the bpf prog into a thread context. We
may create a dedicated kernel thread or use workqueue for this. Do you
have any advice?

> >
> > Thanks Alexei. I gave it more thought in the last couple of days.
> > Actually I think it's a good idea, more flexible. It gets rid of the
> > need of a user space daemon for monitoring cgroup creation and
> > destruction. We could monitor task creations and exits as well, so
> > that we can export per-task information (e.g. task_vma_iter) more
> > efficiently.
>
> Yep. Monitoring task creation and exposing via bpf_iter sounds
> useful too.
>
> > A couple of thoughts when thinking about the details:
> >
> > - Regarding parameterized pinning, I don't think we can have one
> > single bpf_iter_link object, but with different parameters. Because
> > parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every
> > time we pin, we have to attach iter in order to get a new link object
> > first. So we need to add attach and detach in bpf_sys_bpf().
>
> Makes sense.
> I'm adding bpf_link_create to bpf_sys_bpf as part of
> the "lskel for kernel" patch set.
> The detach is sys_close. It's already available.
>
> > - We also need to add those syscalls for cleanup: (1) unlink for
> > removing pinned obj and (2) rmdir for removing the directory in
> > prog_type_syscall.
>
> Yes. These two would be needed.
> And obj_pin too.
>
> > With these extensions, we can shift some of the bpf operations
> > currently performed in system daemons into the kernel. IMHO it's a
> > great thing, making system monitoring more flexible.
>
> Awesome. Sounds like we're converging :)
Alexei Starovoitov Feb. 14, 2022, 7:24 p.m. UTC | #10
On Mon, Feb 14, 2022 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> Hi Alexei,
>
> Actually, I found this almost worked, except that the tracepoints
> cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
> spinlock's critical section with irq off. I guess one solution is to
> offload the sleepable part of the bpf prog into a thread context. We
> may create a dedicated kernel thread or use workqueue for this. Do you
> have any advice?

Are you referring to spin_lock in TRACE_CGROUP_PATH
that protects global trace_cgroup_path[] buffer?
That is fixable.
Do you actually need the string path returned by cgroup_path() in bpf prog?
Maybe prog can call cgroup_path() by itself when necessary.
Parsing strings isn't great anyway. The bpf prog probably needs the last
part of the dir only. So cgrp->kn->name would do it?
The TRACE_CGROUP_PATH wasn't designed to be turned on 24/7.
That global spin_lock is not great for production use.
No need to delegate sleepable bpf to thread context.
Let's refactor that tracepoint a bit.
Hao Luo Feb. 14, 2022, 8:23 p.m. UTC | #11
On Mon, Feb 14, 2022 at 11:25 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Feb 14, 2022 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> > Hi Alexei,
> >
> > Actually, I found this almost worked, except that the tracepoints
> > cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
> > spinlock's critical section with irq off. I guess one solution is to
> > offload the sleepable part of the bpf prog into a thread context. We
> > may create a dedicated kernel thread or use workqueue for this. Do you
> > have any advice?
>
> Are you referring to spin_lock in TRACE_CGROUP_PATH
> that protects global trace_cgroup_path[] buffer?

Yes, that's the spin_lock I am talking about.

> That is fixable.
> Do you actually need the string path returned by cgroup_path() in bpf prog?
> Maybe prog can call cgroup_path() by itself when necessary.
> Parsing strings isn't great anyway. The bpf prog probably needs the last
> part of the dir only. So cgrp->kn->name would do it?
> The TRACE_CGROUP_PATH wasn't designed to be turned on 24/7.
> That global spin_lock is not great for production use.
> No need to delegate sleepable bpf to thread context.
> Let's refactor that tracepoint a bit.

No, we don't need cgroup_path(). We are going to name the directories
by cgrp->kn->id. I can add a fast version for cgroup_xxx tracepoints,
which don't require the full path and can be turned on 24/7.
Alexei Starovoitov Feb. 14, 2022, 8:27 p.m. UTC | #12
On Mon, Feb 14, 2022 at 12:23 PM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Feb 14, 2022 at 11:25 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Feb 14, 2022 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> > > Hi Alexei,
> > >
> > > Actually, I found this almost worked, except that the tracepoints
> > > cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
> > > spinlock's critical section with irq off. I guess one solution is to
> > > offload the sleepable part of the bpf prog into a thread context. We
> > > may create a dedicated kernel thread or use workqueue for this. Do you
> > > have any advice?
> >
> > Are you referring to spin_lock in TRACE_CGROUP_PATH
> > that protects global trace_cgroup_path[] buffer?
>
> Yes, that's the spin_lock I am talking about.
>
> > That is fixable.
> > Do you actually need the string path returned by cgroup_path() in bpf prog?
> > Maybe prog can call cgroup_path() by itself when necessary.
> > Parsing strings isn't great anyway. The bpf prog probably needs the last
> > part of the dir only. So cgrp->kn->name would do it?
> > The TRACE_CGROUP_PATH wasn't designed to be turned on 24/7.
> > That global spin_lock is not great for production use.
> > No need to delegate sleepable bpf to thread context.
> > Let's refactor that tracepoint a bit.
>
> No, we don't need cgroup_path(). We are going to name the directories
> by cgrp->kn->id. I can add a fast version for cgroup_xxx tracepoints,
> which don't require the full path and can be turned on 24/7.

Sounds good. We need a flag for tracepoints anyway to indicate
which ones are sleepable.
Probably similar to what we did for DEFINE_EVENT_WRITABLE.
Hao Luo Feb. 14, 2022, 8:39 p.m. UTC | #13
On Mon, Feb 14, 2022 at 12:28 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Feb 14, 2022 at 12:23 PM Hao Luo <haoluo@google.com> wrote:
> >
> > On Mon, Feb 14, 2022 at 11:25 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Feb 14, 2022 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> > > > Hi Alexei,
> > > >
> > > > Actually, I found this almost worked, except that the tracepoints
> > > > cgroup_mkdir and cgroup_rmdir are not sleepable. They are inside a
> > > > spinlock's critical section with irq off. I guess one solution is to
> > > > offload the sleepable part of the bpf prog into a thread context. We
> > > > may create a dedicated kernel thread or use workqueue for this. Do you
> > > > have any advice?
> > >
> > > Are you referring to spin_lock in TRACE_CGROUP_PATH
> > > that protects global trace_cgroup_path[] buffer?
> >
> > Yes, that's the spin_lock I am talking about.
> >
> > > That is fixable.
> > > Do you actually need the string path returned by cgroup_path() in bpf prog?
> > > Maybe prog can call cgroup_path() by itself when necessary.
> > > Parsing strings isn't great anyway. The bpf prog probably needs the last
> > > part of the dir only. So cgrp->kn->name would do it?
> > > The TRACE_CGROUP_PATH wasn't designed to be turned on 24/7.
> > > That global spin_lock is not great for production use.
> > > No need to delegate sleepable bpf to thread context.
> > > Let's refactor that tracepoint a bit.
> >
> > No, we don't need cgroup_path(). We are going to name the directories
> > by cgrp->kn->id. I can add a fast version for cgroup_xxx tracepoints,
> > which don't require the full path and can be turned on 24/7.
>
> Sounds good. We need a flag for tracepoints anyway to indicate
> which ones are sleepable.
> Probably similar to what we did for DEFINE_EVENT_WRITABLE.

No problem. I'll take a look at it. Thanks!
diff mbox series

Patch

diff --git a/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c b/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c
new file mode 100644
index 000000000000..ebef154e63c9
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c
@@ -0,0 +1,143 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <test_progs.h>
+#include <time.h>
+#include <unistd.h>
+#include "bpf_iter_cgroup_view.skel.h"
+
+static void spin_on_cpu(int seconds)
+{
+	time_t start, now;
+
+	start = time(NULL);
+	do {
+		now = time(NULL);
+	} while (now - start < seconds);
+}
+
+static void do_work(const char *cgroup)
+{
+	int i, cpu = 0, pid;
+	char cmd[128];
+
+	/* make cgroup threaded */
+	snprintf(cmd, 128, "echo threaded > %s/cgroup.type", cgroup);
+	system(cmd);
+
+	/* try to enable cpu controller. this may fail if there cpu controller
+	 * is not available in cgroup.controllers or there is a cgroup v1 already
+	 * mounted in the system.
+	 */
+	snprintf(cmd, 128, "echo \"+cpu\" > %s/cgroup.subtree_control", cgroup);
+	system(cmd);
+
+	/* launch two children, both running in child cgroup */
+	for (i = 0; i < 2; ++i) {
+		pid = fork();
+		if (pid == 0) {
+			/* attach to cgroup */
+			snprintf(cmd, 128, "echo %d > %s/cgroup.procs", getpid(), cgroup);
+			system(cmd);
+
+			/* pin process to target cpu */
+			snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid());
+			system(cmd);
+
+			spin_on_cpu(3); /* spin on cpu for 3 seconds */
+			exit(0);
+		}
+	}
+
+	/* pin process to target cpu */
+	snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid());
+	system(cmd);
+
+	spin_on_cpu(3); /* spin on cpu for 3 seconds */
+	wait(NULL);
+}
+
+static void check_pinning(const char *rootpath)
+{
+	const char *child_cgroup = "/sys/fs/cgroup/child";
+	struct bpf_iter_cgroup_view *skel;
+	struct bpf_link *link;
+	struct stat statbuf = {};
+	FILE *file;
+	unsigned long queue_self, queue_other;
+	int cgroup_id, link_fd;
+	char path[64];
+	char buf[64];
+
+	skel = bpf_iter_cgroup_view__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_cgroup_view__open_and_load"))
+		return;
+
+	/* pin path at parent dir. */
+	link = bpf_program__attach_iter(skel->progs.dump_cgroup_lat, NULL);
+	link_fd = bpf_link__fd(link);
+
+	/* test initial pinning */
+	snprintf(path, 64, "%s/obj", rootpath);
+	ASSERT_OK(bpf_obj_pin(link_fd, path), "bpf_obj_pin");
+	ASSERT_OK(stat(path, &statbuf), "pinned_object_exists");
+
+	/* test mkdir */
+	mkdir(child_cgroup, 0755);
+	snprintf(path, 64, "%s/child", rootpath);
+	ASSERT_OK(mkdir(path, 0755), "mkdir");
+
+	/* test that new dir has been pre-populated with pinned objects */
+	snprintf(path, 64, "%s/child/obj", rootpath);
+	ASSERT_OK(stat(path, &statbuf), "populate");
+
+	bpf_iter_cgroup_view__attach(skel);
+	do_work(child_cgroup);
+	bpf_iter_cgroup_view__detach(skel);
+
+	/* test cat inherited objects */
+	file = fopen(path, "r");
+	if (ASSERT_OK_PTR(file, "open")) {
+		ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+		ASSERT_EQ(sscanf(buf, "cgroup_id: %8d", &cgroup_id), 1, "output");
+
+		ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+		ASSERT_EQ(sscanf(buf, "queue_self: %8lu", &queue_self), 1, "output");
+
+		ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat");
+		ASSERT_EQ(sscanf(buf, "queue_other: %8lu", &queue_other), 1, "output");
+
+		fclose(file);
+	}
+
+	/* test rmdir */
+	snprintf(path, 64, "%s/child", rootpath);
+	ASSERT_OK(rmdir(path), "rmdir");
+
+	/* unpin object */
+	snprintf(path, 64, "%s/obj", rootpath);
+	ASSERT_OK(unlink(path), "unlink");
+
+	bpf_link__destroy(link);
+	bpf_iter_cgroup_view__destroy(skel);
+}
+
+void test_pinning_cgroup(void)
+{
+	char tmpl[] = "/sys/fs/bpf/pinning_test_XXXXXX";
+	char *rootpath;
+
+	system("mount -t cgroup2 none /sys/fs/cgroup");
+	system("mount -t bpf bpffs /sys/fs/bpf");
+
+	rootpath = mkdtemp(tmpl);
+	chmod(rootpath, 0755);
+
+	/* check pinning map, prog and link in kernfs */
+	if (test__start_subtest("pinning"))
+		check_pinning(rootpath);
+
+	rmdir(rootpath);
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/selftests/bpf/progs/bpf_iter.h
index 8cfaeba1ddbf..506bb3efd9b4 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter.h
+++ b/tools/testing/selftests/bpf/progs/bpf_iter.h
@@ -16,6 +16,7 @@ 
 #define bpf_iter__bpf_map_elem bpf_iter__bpf_map_elem___not_used
 #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_used
 #define bpf_iter__sockmap bpf_iter__sockmap___not_used
+#define bpf_iter__cgroup_view bpf_iter__cgroup_view___not_used
 #define btf_ptr btf_ptr___not_used
 #define BTF_F_COMPACT BTF_F_COMPACT___not_used
 #define BTF_F_NONAME BTF_F_NONAME___not_used
@@ -37,6 +38,7 @@ 
 #undef bpf_iter__bpf_map_elem
 #undef bpf_iter__bpf_sk_storage_map
 #undef bpf_iter__sockmap
+#undef bpf_iter__cgroup_view
 #undef btf_ptr
 #undef BTF_F_COMPACT
 #undef BTF_F_NONAME
@@ -132,6 +134,11 @@  struct bpf_iter__sockmap {
 	struct sock *sk;
 };
 
+struct bpf_iter__cgroup_view {
+	struct bpf_iter_meta *meta;
+	struct cgroup *cgroup;
+} __attribute__((preserve_access_index));
+
 struct btf_ptr {
 	void *ptr;
 	__u32 type_id;
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c b/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c
new file mode 100644
index 000000000000..43404c21aee3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c
@@ -0,0 +1,232 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Google */
+#include "bpf_iter.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define TASK_RUNNING 0
+#define BPF_F_CURRENT_CPU 0xffffffffULL
+
+extern void fair_sched_class __ksym;
+extern bool CONFIG_FAIR_GROUP_SCHED __kconfig;
+extern bool CONFIG_CGROUP_SCHED __kconfig;
+
+struct wait_lat {
+	/* Queue_self stands for the latency a task experiences while waiting
+	 * behind the tasks that are from the same cgroup.
+	 *
+	 * Queue_other stands for the latency a task experiences while waiting
+	 * behind the tasks that are from other cgroups.
+	 *
+	 * For example, if there are three tasks: A, B and C. Suppose A and B
+	 * are in the same cgroup and C is in another cgroup and we see A has
+	 * a queueing latency X milliseconds. Let's say during the X milliseconds,
+	 * B has run for Y milliseconds. We can break down X to two parts: time
+	 * when B is on cpu, that is Y; the time when C is on cpu, that is X - Y.
+	 *
+	 * Queue_self is the former (Y) while queue_other is the latter (X - Y).
+	 *
+	 * large value in queue_self is an indication of contention within a
+	 * cgroup; while large value in queue_other is an indication of
+	 * contention from multiple cgroups.
+	 */
+	u64 queue_self;
+	u64 queue_other;
+};
+
+struct timestamp {
+	/* timestamp when last queued */
+	u64 tsp;
+
+	/* cgroup exec_clock when last queued */
+	u64 exec_clock;
+};
+
+/* Map to store per-cgroup wait latency */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, u64);
+	__type(value, struct wait_lat);
+	__uint(max_entries, 65532);
+} cgroup_lat SEC(".maps");
+
+/* Map to store per-task queue timestamp */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct timestamp);
+} start SEC(".maps");
+
+/* adapt from task_cfs_rq in kernel/sched/sched.h */
+__always_inline
+struct cfs_rq *task_cfs_rq(struct task_struct *t)
+{
+	if (!CONFIG_FAIR_GROUP_SCHED)
+		return NULL;
+
+	return BPF_CORE_READ(&t->se, cfs_rq);
+}
+
+/* record enqueue timestamp */
+__always_inline
+static int trace_enqueue(struct task_struct *t)
+{
+	u32 pid = t->pid;
+	struct timestamp *ptr;
+	struct cfs_rq *cfs_rq;
+
+	if (!pid)
+		return 0;
+
+	/* only measure for CFS tasks */
+	if (t->sched_class != &fair_sched_class)
+		return 0;
+
+	ptr = bpf_task_storage_get(&start, t, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!ptr)
+		return 0;
+
+	/* CONFIG_FAIR_GROUP_SCHED may not be enabled */
+	cfs_rq = task_cfs_rq(t);
+	if (!cfs_rq)
+		return 0;
+
+	ptr->tsp = bpf_ktime_get_ns();
+	ptr->exec_clock = BPF_CORE_READ(cfs_rq, exec_clock);
+	return 0;
+}
+
+SEC("tp_btf/sched_wakeup")
+int handle__sched_wakeup(u64 *ctx)
+{
+	/* TP_PROTO(struct task_struct *p) */
+	struct task_struct *p = (void *)ctx[0];
+
+	return trace_enqueue(p);
+}
+
+SEC("tp_btf/sched_wakeup_new")
+int handle__sched_wakeup_new(u64 *ctx)
+{
+	/* TP_PROTO(struct task_struct *p) */
+	struct task_struct *p = (void *)ctx[0];
+
+	return trace_enqueue(p);
+}
+
+/* task_group() from kernel/sched/sched.h */
+__always_inline
+struct task_group *task_group(struct task_struct *p)
+{
+	if (!CONFIG_CGROUP_SCHED)
+		return NULL;
+
+	return BPF_CORE_READ(p, sched_task_group);
+}
+
+__always_inline
+struct cgroup *task_cgroup(struct task_struct *p)
+{
+	struct task_group *tg;
+
+	tg = task_group(p);
+	if (!tg)
+		return NULL;
+
+	return BPF_CORE_READ(tg, css).cgroup;
+}
+
+__always_inline
+u64 max(u64 x, u64 y)
+{
+	return x > y ? x : y;
+}
+
+SEC("tp_btf/sched_switch")
+int handle__sched_switch(u64 *ctx)
+{
+	/* TP_PROTO(bool preempt, struct task_struct *prev,
+	 *	    struct task_struct *next)
+	 */
+	struct task_struct *prev = (struct task_struct *)ctx[1];
+	struct task_struct *next = (struct task_struct *)ctx[2];
+	u64 delta, delta_self, delta_other, id;
+	struct cfs_rq *cfs_rq;
+	struct timestamp *tsp;
+	struct wait_lat *lat;
+	struct cgroup *cgroup;
+
+	/* ivcsw: treat like an enqueue event and store timestamp */
+	if (prev->__state == TASK_RUNNING)
+		trace_enqueue(prev);
+
+	/* only measure for CFS tasks */
+	if (next->sched_class != &fair_sched_class)
+		return 0;
+
+	/* fetch timestamp and calculate delta */
+	tsp = bpf_task_storage_get(&start, next, 0, 0);
+	if (!tsp)
+		return 0;   /* missed enqueue */
+
+	/* CONFIG_FAIR_GROUP_SCHED may not be enabled */
+	cfs_rq = task_cfs_rq(next);
+	if (!cfs_rq)
+		return 0;
+
+	/* cpu controller may not be enabled */
+	cgroup = task_cgroup(next);
+	if (!cgroup)
+		return 0;
+
+	/* calculate self delay and other delay */
+	delta = bpf_ktime_get_ns() - tsp->tsp;
+	delta_self = BPF_CORE_READ(cfs_rq, exec_clock) - tsp->exec_clock;
+	if (delta_self > delta)
+		delta_self = delta;
+	delta_other = delta - delta_self;
+
+	/* insert into cgroup_lat map */
+	id = BPF_CORE_READ(cgroup, kn, id);
+	lat = bpf_map_lookup_elem(&cgroup_lat, &id);
+	if (!lat) {
+		struct wait_lat w = {
+			.queue_self = delta_self,
+			.queue_other = delta_other,
+		};
+
+		bpf_map_update_elem(&cgroup_lat, &id, &w, BPF_ANY);
+	} else {
+		lat->queue_self = max(delta_self, lat->queue_self);
+		lat->queue_other = max(delta_other, lat->queue_other);
+	}
+
+	bpf_task_storage_delete(&start, next);
+	return 0;
+}
+
+SEC("iter/cgroup_view")
+int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx)
+{
+	struct seq_file *seq = ctx->meta->seq;
+	struct cgroup *cgroup = ctx->cgroup;
+	struct wait_lat *lat;
+	u64 id;
+
+	BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id);
+	lat = bpf_map_lookup_elem(&cgroup_lat, &id);
+	if (lat) {
+		BPF_SEQ_PRINTF(seq, "queue_self: %8lu\n", lat->queue_self);
+		BPF_SEQ_PRINTF(seq, "queue_other: %8lu\n", lat->queue_other);
+	} else {
+		/* print anyway for universal parsing logic in userspace. */
+		BPF_SEQ_PRINTF(seq, "queue_self: %8d\n", 0);
+		BPF_SEQ_PRINTF(seq, "queue_other: %8d\n", 0);
+	}
+	return 0;
+}