Message ID | 20220520012133.1217211-4-yosryahmed@google.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | bpf: rstat: cgroup hierarchical stats | expand |
On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote: > From: Hao Luo <haoluo@google.com> > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this > iter doesn't iterate a set of kernel objects. Instead, it is supposed to > be parameterized by a cgroup id and prints only that cgroup. So one > needs to specify a target cgroup id when attaching this iter. The target > cgroup's state can be read out via a link of this iter. > > Signed-off-by: Hao Luo <haoluo@google.com> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> This could be me not understanding why it's structured this way but it keeps bothering me that this is adding a cgroup iterator which doesn't iterate cgroups. If all that's needed is extracting information from a specific cgroup, why does this need to be an iterator? e.g. why can't I use BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes rstat, retrieves whatever information necessary and returns that as the result? Thanks.
On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote: > > On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote: > > From: Hao Luo <haoluo@google.com> > > > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to > > be parameterized by a cgroup id and prints only that cgroup. So one > > needs to specify a target cgroup id when attaching this iter. The target > > cgroup's state can be read out via a link of this iter. > > > > Signed-off-by: Hao Luo <haoluo@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > This could be me not understanding why it's structured this way but it keeps > bothering me that this is adding a cgroup iterator which doesn't iterate > cgroups. If all that's needed is extracting information from a specific > cgroup, why does this need to be an iterator? e.g. why can't I use > BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes > rstat, retrieves whatever information necessary and returns that as the > result? I will let Hao and Yonghong reply here as they have a lot more context, and they had previous discussions about cgroup_iter. I just want to say that exposing the stats in a file is extremely convenient for userspace apps. It becomes very similar to reading stats from cgroupfs. It also makes migrating cgroup stats that we have implemented in the kernel to BPF a lot easier. AFAIK there are also discussions about using overlayfs to have links to the bpffs files in cgroupfs, which makes it even better. So I would really prefer keeping the approach we have here of reading stats through a file from userspace. As for how we go about this (and why a cgroup iterator doesn't iterate cgroups) I will leave this for Hao and Yonghong to explain the rationale behind it. Ideally we can keep the same functionality under a more descriptive name/type. > > Thanks. > > -- > tejun
Hello, On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote: > On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote: > > > > On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote: > > > From: Hao Luo <haoluo@google.com> > > > > > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this > > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to > > > be parameterized by a cgroup id and prints only that cgroup. So one > > > needs to specify a target cgroup id when attaching this iter. The target > > > cgroup's state can be read out via a link of this iter. > > > > > > Signed-off-by: Hao Luo <haoluo@google.com> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > This could be me not understanding why it's structured this way but it keeps > > bothering me that this is adding a cgroup iterator which doesn't iterate > > cgroups. If all that's needed is extracting information from a specific > > cgroup, why does this need to be an iterator? e.g. why can't I use > > BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes > > rstat, retrieves whatever information necessary and returns that as the > > result? > > I will let Hao and Yonghong reply here as they have a lot more > context, and they had previous discussions about cgroup_iter. I just > want to say that exposing the stats in a file is extremely convenient > for userspace apps. It becomes very similar to reading stats from > cgroupfs. It also makes migrating cgroup stats that we have > implemented in the kernel to BPF a lot easier. So, if it were upto me, I'd rather direct energy towards making retrieving information through TEST_RUN_PROG easier rather than clinging to making kernel output text. I get that text interface is familiar but it kinda sucks in many ways. > AFAIK there are also discussions about using overlayfs to have links > to the bpffs files in cgroupfs, which makes it even better. So I would > really prefer keeping the approach we have here of reading stats > through a file from userspace. As for how we go about this (and why a > cgroup iterator doesn't iterate cgroups) I will leave this for Hao and > Yonghong to explain the rationale behind it. Ideally we can keep the > same functionality under a more descriptive name/type. My answer would be the same here. You guys seem dead set on making the kernel emulate cgroup1. I'm not gonna explicitly block that but would strongly suggest having a longer term view. If you *must* do the iterator, can you at least make it a proper iterator which supports seeking? AFAICS there's nothing fundamentally preventing bpf iterators from supporting seeking. Or is it that you need something which is pinned to a cgroup so that you can emulate the directory structure? Thanks.
On Thu, May 19, 2022 at 10:11:26PM -1000, Tejun Heo wrote: > If you *must* do the iterator, can you at least make it a proper iterator > which supports seeking? AFAICS there's nothing fundamentally preventing bpf > iterators from supporting seeking. Or is it that you need something which is > pinned to a cgroup so that you can emulate the directory structure? Or, alternatively, would it be possible to make a TEST_RUN_PROG to output a text file in bpffs? There just doesn't seem to be anything cgroup specific that the iterator is doing that can't be done with exposing a couple kfuncs. Thanks.
On 5/20/22 1:11 AM, Tejun Heo wrote: > Hello, > > On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote: >> On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote: >>> >>> On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote: >>>> From: Hao Luo <haoluo@google.com> >>>> >>>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this >>>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to >>>> be parameterized by a cgroup id and prints only that cgroup. So one >>>> needs to specify a target cgroup id when attaching this iter. The target >>>> cgroup's state can be read out via a link of this iter. >>>> >>>> Signed-off-by: Hao Luo <haoluo@google.com> >>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >>> >>> This could be me not understanding why it's structured this way but it keeps >>> bothering me that this is adding a cgroup iterator which doesn't iterate >>> cgroups. If all that's needed is extracting information from a specific >>> cgroup, why does this need to be an iterator? e.g. why can't I use >>> BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes >>> rstat, retrieves whatever information necessary and returns that as the >>> result? >> >> I will let Hao and Yonghong reply here as they have a lot more >> context, and they had previous discussions about cgroup_iter. I just >> want to say that exposing the stats in a file is extremely convenient >> for userspace apps. It becomes very similar to reading stats from >> cgroupfs. It also makes migrating cgroup stats that we have >> implemented in the kernel to BPF a lot easier. > > So, if it were upto me, I'd rather direct energy towards making retrieving > information through TEST_RUN_PROG easier rather than clinging to making > kernel output text. I get that text interface is familiar but it kinda > sucks in many ways. > >> AFAIK there are also discussions about using overlayfs to have links >> to the bpffs files in cgroupfs, which makes it even better. So I would >> really prefer keeping the approach we have here of reading stats >> through a file from userspace. As for how we go about this (and why a >> cgroup iterator doesn't iterate cgroups) I will leave this for Hao and >> Yonghong to explain the rationale behind it. Ideally we can keep the >> same functionality under a more descriptive name/type. > > My answer would be the same here. You guys seem dead set on making the > kernel emulate cgroup1. I'm not gonna explicitly block that but would > strongly suggest having a longer term view. > > If you *must* do the iterator, can you at least make it a proper iterator > which supports seeking? AFAICS there's nothing fundamentally preventing bpf > iterators from supporting seeking. Or is it that you need something which is > pinned to a cgroup so that you can emulate the directory structure? The current bpf_iter for cgroup is for the google use case per previous discussion. But I think a generic cgroup bpf iterator should help as well. Maybe you can have a bpf program signature like below: int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp, struct cgroup *parent_cgrp) parent_cgrp is NULL when cgrp is the root cgroup. I would like the bpf program should send the following information to user space: <parent cgroup dir name> <current cgroup dir name> <various stats interested by the user> This way, user space can easily construct the cgroup hierarchy stat like cpu mem cpu pressure mem pressure ... cgroup1 ... child1 ... grandchild1 ... child2 ... cgroup 2 ... child 3 ... ... ... the bpf iterator can have additional parameter like cgroup_id = ... to only call bpf program once with that cgroup_id if specified. The kernel part of cgroup_iter can call cgroup_rstat_flush() before calling cgroup_iter bpf program. WDYT? > > Thanks. >
Hello, Yonghong. On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote: > Maybe you can have a bpf program signature like below: > > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp, > struct cgroup *parent_cgrp) > > parent_cgrp is NULL when cgrp is the root cgroup. > > I would like the bpf program should send the following information to > user space: > <parent cgroup dir name> <current cgroup dir name> I don't think parent cgroup dir name would be sufficient to reconstruct the path given that multiple cgroups in different subtrees can have the same name. For live cgroups, userspace can find the path from id (or ino) without traversing anything by constructing the fhandle, open it open_by_handle_at() and then reading /proc/self/fd/$FD symlink - https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups but I'm not sure how much that'd matter given that they aren't visible from userspace anyway. > <various stats interested by the user> > > This way, user space can easily construct the cgroup hierarchy stat like > cpu mem cpu pressure mem pressure ... > cgroup1 ... > child1 ... > grandchild1 ... > child2 ... > cgroup 2 ... > child 3 ... > ... ... > > the bpf iterator can have additional parameter like > cgroup_id = ... to only call bpf program once with that > cgroup_id if specified. > > The kernel part of cgroup_iter can call cgroup_rstat_flush() > before calling cgroup_iter bpf program. > > WDYT? Would it work to just pass in @cgrp and provide a group of helpers so that the program can do whatever it wanna do including looking up the full path and passing that to userspace? Thanks.
Hi Tejun, On Fri, May 20, 2022 at 1:11 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, May 20, 2022 at 12:58:52AM -0700, Yosry Ahmed wrote: > > On Fri, May 20, 2022 at 12:41 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > On Fri, May 20, 2022 at 01:21:31AM +0000, Yosry Ahmed wrote: > > > > From: Hao Luo <haoluo@google.com> > > > > > > > > Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this > > > > iter doesn't iterate a set of kernel objects. Instead, it is supposed to > > > > be parameterized by a cgroup id and prints only that cgroup. So one > > > > needs to specify a target cgroup id when attaching this iter. The target > > > > cgroup's state can be read out via a link of this iter. > > > > > > > > Signed-off-by: Hao Luo <haoluo@google.com> > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > This could be me not understanding why it's structured this way but it keeps > > > bothering me that this is adding a cgroup iterator which doesn't iterate > > > cgroups. If all that's needed is extracting information from a specific > > > cgroup, why does this need to be an iterator? e.g. why can't I use > > > BPF_PROG_TEST_RUN which looks up the cgroup with the provided ID, flushes > > > rstat, retrieves whatever information necessary and returns that as the > > > result? > > > > I will let Hao and Yonghong reply here as they have a lot more > > context, and they had previous discussions about cgroup_iter. I just > > want to say that exposing the stats in a file is extremely convenient > > for userspace apps. It becomes very similar to reading stats from > > cgroupfs. It also makes migrating cgroup stats that we have > > implemented in the kernel to BPF a lot easier. > > So, if it were upto me, I'd rather direct energy towards making retrieving > information through TEST_RUN_PROG easier rather than clinging to making > kernel output text. I get that text interface is familiar but it kinda > sucks in many ways. > Tejun, could you explain more about the downside of text interfaces and why TEST_RUN_PROG would address the problems in text output? From the discussion we had last time, I understand that your concern was the unstable interface if we introduce bpf files in cgroupfs, so we are moving toward replicating the directory structure in bpffs. But I am not sure about the issue of text format output > > AFAIK there are also discussions about using overlayfs to have links > > to the bpffs files in cgroupfs, which makes it even better. So I would > > really prefer keeping the approach we have here of reading stats > > through a file from userspace. As for how we go about this (and why a > > cgroup iterator doesn't iterate cgroups) I will leave this for Hao and > > Yonghong to explain the rationale behind it. Ideally we can keep the > > same functionality under a more descriptive name/type. > > My answer would be the same here. You guys seem dead set on making the > kernel emulate cgroup1. I'm not gonna explicitly block that but would > strongly suggest having a longer term view. > The reason why Yosry and I are still pushing toward this direction is that our user space applications rely heavily on extracting information from text output for cgroups. Please understand that migrating them from the traditional model to a new model is a bigger pain. But I agree that if we have a better, concrete solution (for example, maybe TEST_RUN_PROG) to convince them and help them migrate, I really would love to contribute and work on it. > If you *must* do the iterator, can you at least make it a proper iterator > which supports seeking? AFAICS there's nothing fundamentally preventing bpf > iterators from supporting seeking. Or is it that you need something which is > pinned to a cgroup so that you can emulate the directory structure? > Yonghong may comment on adding seek for bpf_iter. I would love to contribute if we are in need of that. Right now, we don't have a use case that needs seek for bpf_iter, I think. My thought: for cgroups, we can seek using cgroup id. Maybe, not all kernel objects are indexable, so seeking doesn't apply there? Hao
Hi Tejun and Yonghong, On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote: > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote: > > Maybe you can have a bpf program signature like below: > > > > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp, > > struct cgroup *parent_cgrp) > > > > parent_cgrp is NULL when cgrp is the root cgroup. > > > > I would like the bpf program should send the following information to > > user space: > > <parent cgroup dir name> <current cgroup dir name> > > I don't think parent cgroup dir name would be sufficient to reconstruct the > path given that multiple cgroups in different subtrees can have the same > name. For live cgroups, userspace can find the path from id (or ino) without > traversing anything by constructing the fhandle, open it open_by_handle_at() > and then reading /proc/self/fd/$FD symlink - > https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups > but I'm not sure how much that'd matter given that they aren't visible from > userspace anyway. > Sending cgroup id is better than cgroup dir name, also because IIUC the path obtained from cgroup id depends on the namespace of the userspace process. So if the dump file may be potentially read by processes within a container, it's better to have the output namespaced IMO. > > <various stats interested by the user> > > > > This way, user space can easily construct the cgroup hierarchy stat like > > cpu mem cpu pressure mem pressure ... > > cgroup1 ... > > child1 ... > > grandchild1 ... > > child2 ... > > cgroup 2 ... > > child 3 ... > > ... ... > > > > the bpf iterator can have additional parameter like > > cgroup_id = ... to only call bpf program once with that > > cgroup_id if specified. Yep, this should work. We just need to make the cgroup_id parameter optional. If it is specified when creating bpf_iter_link, we print for that cgroup only. If it is not specified, we iterate over all cgroups. If I understand correctly, sounds doable. > > The kernel part of cgroup_iter can call cgroup_rstat_flush() > > before calling cgroup_iter bpf program. Sounds good to me as well. But my knowledge on rstat_flush is limited. Yosry can give this a try. > > Would it work to just pass in @cgrp and provide a group of helpers so that > the program can do whatever it wanna do including looking up the full path > and passing that to userspace? > My understanding is, yes, doable. If we need the full path information of a cgroup, helpers or kfuncs are needed. The userspace needs to specify the identity of the cgroup, when creating bpf_iter. This identity could be cgroup id or fd. This identity needs to be converted to cgroup object somewhere before passing into bpf program to use.
On Fri, May 20, 2022 at 12:43 PM Hao Luo <haoluo@google.com> wrote: > > Hi Tejun and Yonghong, > > On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote: > > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote: > > > Maybe you can have a bpf program signature like below: > > > > > > int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp, > > > struct cgroup *parent_cgrp) > > > > > > parent_cgrp is NULL when cgrp is the root cgroup. > > > > > > I would like the bpf program should send the following information to > > > user space: > > > <parent cgroup dir name> <current cgroup dir name> > > > > I don't think parent cgroup dir name would be sufficient to reconstruct the > > path given that multiple cgroups in different subtrees can have the same > > name. For live cgroups, userspace can find the path from id (or ino) without > > traversing anything by constructing the fhandle, open it open_by_handle_at() > > and then reading /proc/self/fd/$FD symlink - > > https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups > > but I'm not sure how much that'd matter given that they aren't visible from > > userspace anyway. > > > > Sending cgroup id is better than cgroup dir name, also because IIUC > the path obtained from cgroup id depends on the namespace of the > userspace process. So if the dump file may be potentially read by > processes within a container, it's better to have the output > namespaced IMO. > > > > <various stats interested by the user> > > > > > > This way, user space can easily construct the cgroup hierarchy stat like > > > cpu mem cpu pressure mem pressure ... > > > cgroup1 ... > > > child1 ... > > > grandchild1 ... > > > child2 ... > > > cgroup 2 ... > > > child 3 ... > > > ... ... > > > > > > the bpf iterator can have additional parameter like > > > cgroup_id = ... to only call bpf program once with that > > > cgroup_id if specified. > > Yep, this should work. We just need to make the cgroup_id parameter > optional. If it is specified when creating bpf_iter_link, we print for > that cgroup only. If it is not specified, we iterate over all cgroups. > If I understand correctly, sounds doable. > > > > The kernel part of cgroup_iter can call cgroup_rstat_flush() > > > before calling cgroup_iter bpf program. > > Sounds good to me as well. But my knowledge on rstat_flush is limited. > Yosry can give this a try. > > > > > Would it work to just pass in @cgrp and provide a group of helpers so that > > the program can do whatever it wanna do including looking up the full path > > and passing that to userspace? > > > > My understanding is, yes, doable. If we need the full path information > of a cgroup, helpers or kfuncs are needed. > > The userspace needs to specify the identity of the cgroup, when > creating bpf_iter. This identity could be cgroup id or fd. This > identity needs to be converted to cgroup object somewhere before > passing into bpf program to use. Let's sum up the discussion here, I feel like we are losing track of the main problem. IIUC the main concern is that cgroup_iter is not effectively an iterator, it rather dumps information for one cgroup. I like the suggestion to make it iterate cgroups by default, and an optional cgroup_id parameter to make it only "iterate" this one cgroup. IIUC, this cgroup_id parameter would be a link parameter, similar to the current approach. Basically, we extend the current patch so that if cgroup_id is not specified the iterator gets called for all cgroups instead of one. This fixes the problem for our use case and also keeps cgroup_iter generic enough. Is my understanding correct? If yes, I don't see a need to flush rstat in the kernel on behalf of cgroup_iter progs.
Hi Tejun and Yonghong, On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote: > > Hi Tejun and Yonghong, > > On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote: > > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote: > > > <various stats interested by the user> > > > > > > This way, user space can easily construct the cgroup hierarchy stat like > > > cpu mem cpu pressure mem pressure ... > > > cgroup1 ... > > > child1 ... > > > grandchild1 ... > > > child2 ... > > > cgroup 2 ... > > > child 3 ... > > > ... ... > > > > > > the bpf iterator can have additional parameter like > > > cgroup_id = ... to only call bpf program once with that > > > cgroup_id if specified. > > Yep, this should work. We just need to make the cgroup_id parameter > optional. If it is specified when creating bpf_iter_link, we print for > that cgroup only. If it is not specified, we iterate over all cgroups. > If I understand correctly, sounds doable. > Yonghong, I realized that seek() which Tejun has been calling out, can be used to specify the target cgroup, rather than adding a new parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter, which will instruct read() to return the corresponding cgroup's stats. On the other hand, reading without calling seek() beforehand will return all the cgroups. WDYT?
On Fri, May 20, 2022 at 02:18:42PM -0700, Yosry Ahmed wrote: > > > > The userspace needs to specify the identity of the cgroup, when > > creating bpf_iter. This identity could be cgroup id or fd. This > > identity needs to be converted to cgroup object somewhere before > > passing into bpf program to use. > > > Let's sum up the discussion here, I feel like we are losing track of > the main problem. IIUC the main concern is that cgroup_iter is not > effectively an iterator, it rather dumps information for one cgroup. I > like the suggestion to make it iterate cgroups by default, and an > optional cgroup_id parameter to make it only "iterate" this one > cgroup. We have bpf_map iterator that walks all bpf maps. When map iterator is parametrized with map_fd the iterator walks all elements of that map. cgroup iterator should have similar semantics. When non-parameterized it will walk all cgroups and their descendent depth first way. I believe that's what Yonghong is proposing. When parametrized it will start from that particular cgroup and walk all descendant of that cgroup only. The bpf prog can stop the iteration right away with ret 1. Maybe we can add two parameters. One -> cgroup_fd to use and another -> the order of iteration css_for_each_descendant_pre vs _post. wdyt?
On Fri, May 20, 2022 at 3:19 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, May 20, 2022 at 02:18:42PM -0700, Yosry Ahmed wrote: > > > > > > The userspace needs to specify the identity of the cgroup, when > > > creating bpf_iter. This identity could be cgroup id or fd. This > > > identity needs to be converted to cgroup object somewhere before > > > passing into bpf program to use. > > > > > > Let's sum up the discussion here, I feel like we are losing track of > > the main problem. IIUC the main concern is that cgroup_iter is not > > effectively an iterator, it rather dumps information for one cgroup. I > > like the suggestion to make it iterate cgroups by default, and an > > optional cgroup_id parameter to make it only "iterate" this one > > cgroup. > > We have bpf_map iterator that walks all bpf maps. > When map iterator is parametrized with map_fd the iterator walks > all elements of that map. > cgroup iterator should have similar semantics. > When non-parameterized it will walk all cgroups and their descendent > depth first way. I believe that's what Yonghong is proposing. > When parametrized it will start from that particular cgroup and > walk all descendant of that cgroup only. > The bpf prog can stop the iteration right away with ret 1. > Maybe we can add two parameters. One -> cgroup_fd to use and another -> > the order of iteration css_for_each_descendant_pre vs _post. > wdyt? So basically extend the current patch so that cgroup_id (or cgroup_fd) is optional, and it specifies where the iteration starts. If not provided, then we start at root. For our use case where we want the iterator to only be invoked for one cgroup we make it return 1 to stop after the first iteration. I assume an order parameter is also needed to specify "pre" for our use case to make sure we are starting iteration at the top cgroup (the one whose cgroup_id is the parameter of the iterator). Is my understanding correct? If yes, then this sounds very good. It is generic enough, actually iterates cgroups, and works for our use case.
Hello, On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote: > We have bpf_map iterator that walks all bpf maps. > When map iterator is parametrized with map_fd the iterator walks > all elements of that map. > cgroup iterator should have similar semantics. > When non-parameterized it will walk all cgroups and their descendent > depth first way. I believe that's what Yonghong is proposing. > When parametrized it will start from that particular cgroup and > walk all descendant of that cgroup only. > The bpf prog can stop the iteration right away with ret 1. > Maybe we can add two parameters. One -> cgroup_fd to use and another -> > the order of iteration css_for_each_descendant_pre vs _post. > wdyt? Sounds perfectly reasonable to me. Thanks.
On 5/20/22 9:45 AM, Tejun Heo wrote: > Hello, Yonghong. > > On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote: >> Maybe you can have a bpf program signature like below: >> >> int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp, >> struct cgroup *parent_cgrp) >> >> parent_cgrp is NULL when cgrp is the root cgroup. >> >> I would like the bpf program should send the following information to >> user space: >> <parent cgroup dir name> <current cgroup dir name> > > I don't think parent cgroup dir name would be sufficient to reconstruct the > path given that multiple cgroups in different subtrees can have the same > name. For live cgroups, userspace can find the path from id (or ino) without > traversing anything by constructing the fhandle, open it open_by_handle_at() > and then reading /proc/self/fd/$FD symlink - > https://lkml.org/lkml/2020/12/2/1126. This isn't available for dead cgroups > but I'm not sure how much that'd matter given that they aren't visible from > userspace anyway. passing id/ino to user space and then get directory name in userspace should work just fine. > >> <various stats interested by the user> >> >> This way, user space can easily construct the cgroup hierarchy stat like >> cpu mem cpu pressure mem pressure ... >> cgroup1 ... >> child1 ... >> grandchild1 ... >> child2 ... >> cgroup 2 ... >> child 3 ... >> ... ... >> >> the bpf iterator can have additional parameter like >> cgroup_id = ... to only call bpf program once with that >> cgroup_id if specified. >> >> The kernel part of cgroup_iter can call cgroup_rstat_flush() >> before calling cgroup_iter bpf program. >> >> WDYT? > > Would it work to just pass in @cgrp and provide a group of helpers so that > the program can do whatever it wanna do including looking up the full path > and passing that to userspace? I am not super familiar with cgroup internals, I guess with cgroup + helpers to retrieve stats, or directly expose stats data structure to bpf program. Either one is okay to me as long as we can get desired results. > > Thanks. >
On 5/20/22 2:49 PM, Hao Luo wrote: > Hi Tejun and Yonghong, > > On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote: >> >> Hi Tejun and Yonghong, >> >> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote: >>> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote: >>>> <various stats interested by the user> >>>> >>>> This way, user space can easily construct the cgroup hierarchy stat like >>>> cpu mem cpu pressure mem pressure ... >>>> cgroup1 ... >>>> child1 ... >>>> grandchild1 ... >>>> child2 ... >>>> cgroup 2 ... >>>> child 3 ... >>>> ... ... >>>> >>>> the bpf iterator can have additional parameter like >>>> cgroup_id = ... to only call bpf program once with that >>>> cgroup_id if specified. >> >> Yep, this should work. We just need to make the cgroup_id parameter >> optional. If it is specified when creating bpf_iter_link, we print for >> that cgroup only. If it is not specified, we iterate over all cgroups. >> If I understand correctly, sounds doable. >> > > Yonghong, I realized that seek() which Tejun has been calling out, can > be used to specify the target cgroup, rather than adding a new > parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter, > which will instruct read() to return the corresponding cgroup's stats. > On the other hand, reading without calling seek() beforehand will > return all the cgroups. Currently, seek is not supported for bpf_iter. const struct file_operations bpf_iter_fops = { .open = iter_open, .llseek = no_llseek, .read = bpf_seq_read, .release = iter_release, }; But if seek() works, I don't mind to remove this restriction. But not sure what to seek. Do you mean to provide a cgroup_fd/cgroup_id as the seek() syscall parameter? This may work. But considering we have parameterized example (map_fd) and in the future, we may have other parameterized bpf_iter (e.g., for one task). Maybe parameter-based approach is better. > > WDYT?
On 5/20/22 3:57 PM, Tejun Heo wrote: > Hello, > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote: >> We have bpf_map iterator that walks all bpf maps. >> When map iterator is parametrized with map_fd the iterator walks >> all elements of that map. >> cgroup iterator should have similar semantics. >> When non-parameterized it will walk all cgroups and their descendent >> depth first way. I believe that's what Yonghong is proposing. >> When parametrized it will start from that particular cgroup and >> walk all descendant of that cgroup only. >> The bpf prog can stop the iteration right away with ret 1. >> Maybe we can add two parameters. One -> cgroup_fd to use and another -> >> the order of iteration css_for_each_descendant_pre vs _post. >> wdyt? > > Sounds perfectly reasonable to me. This works for me too. Thanks! > > Thanks. >
On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote: > On 5/20/22 3:57 PM, Tejun Heo wrote: > > Hello, > > > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote: > >> We have bpf_map iterator that walks all bpf maps. > >> When map iterator is parametrized with map_fd the iterator walks > >> all elements of that map. > >> cgroup iterator should have similar semantics. > >> When non-parameterized it will walk all cgroups and their descendent > >> depth first way. I believe that's what Yonghong is proposing. > >> When parametrized it will start from that particular cgroup and > >> walk all descendant of that cgroup only. > >> The bpf prog can stop the iteration right away with ret 1. > >> Maybe we can add two parameters. One -> cgroup_fd to use and another -> > >> the order of iteration css_for_each_descendant_pre vs _post. > >> wdyt? > > > > Sounds perfectly reasonable to me. > > This works for me too. Thanks! > This sounds good to me. Thanks. Let's try to do it in the next iteration.
On Fri, May 20, 2022 at 5:58 PM Yonghong Song <yhs@fb.com> wrote: > On 5/20/22 2:49 PM, Hao Luo wrote: > > Hi Tejun and Yonghong, > > > > On Fri, May 20, 2022 at 12:42 PM Hao Luo <haoluo@google.com> wrote: > >> > >> Hi Tejun and Yonghong, > >> > >> On Fri, May 20, 2022 at 9:45 AM Tejun Heo <tj@kernel.org> wrote: > >>> On Fri, May 20, 2022 at 09:29:43AM -0700, Yonghong Song wrote: > >>>> <various stats interested by the user> > >>>> > >>>> This way, user space can easily construct the cgroup hierarchy stat like > >>>> cpu mem cpu pressure mem pressure ... > >>>> cgroup1 ... > >>>> child1 ... > >>>> grandchild1 ... > >>>> child2 ... > >>>> cgroup 2 ... > >>>> child 3 ... > >>>> ... ... > >>>> > >>>> the bpf iterator can have additional parameter like > >>>> cgroup_id = ... to only call bpf program once with that > >>>> cgroup_id if specified. > >> > >> Yep, this should work. We just need to make the cgroup_id parameter > >> optional. If it is specified when creating bpf_iter_link, we print for > >> that cgroup only. If it is not specified, we iterate over all cgroups. > >> If I understand correctly, sounds doable. > >> > > > > Yonghong, I realized that seek() which Tejun has been calling out, can > > be used to specify the target cgroup, rather than adding a new > > parameter. Maybe, we can pass cgroup_id to seek() on cgroup bpf_iter, > > which will instruct read() to return the corresponding cgroup's stats. > > On the other hand, reading without calling seek() beforehand will > > return all the cgroups. > > Currently, seek is not supported for bpf_iter. > > const struct file_operations bpf_iter_fops = { > .open = iter_open, > .llseek = no_llseek, > .read = bpf_seq_read, > .release = iter_release, > }; > > But if seek() works, I don't mind to remove this restriction. > But not sure what to seek. Do you mean to provide a cgroup_fd/cgroup_id > as the seek() syscall parameter? This may work. Yes, passing a cgroup_id as the seek() syscall parameter was what I meant. Tejun previously requested us to support seek() for a proper iterator. Since Alexei has a nice solution that all of us have ack'ed, I am not sure whether we still want to add seek() for bpf_iter as Tejun asked. I guess not. > > But considering we have parameterized example (map_fd) and > in the future, we may have other parameterized bpf_iter > (e.g., for one task). Maybe parameter-based approach is better. > Acknowledged. > > > > WDYT?
On Fri, May 20, 2022 at 07:43:12PM -0700, Hao Luo wrote: > Yes, passing a cgroup_id as the seek() syscall parameter was what I meant. > > Tejun previously requested us to support seek() for a proper iterator. > Since Alexei has a nice solution that all of us have ack'ed, I am not > sure whether we still want to add seek() for bpf_iter as Tejun asked. > I guess not. Yeah, I meant seeking with the ID but it's better to follow the same convention as other iterators. Thanks.
On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote: > > On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote: > > On 5/20/22 3:57 PM, Tejun Heo wrote: > > > Hello, > > > > > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote: > > >> We have bpf_map iterator that walks all bpf maps. > > >> When map iterator is parametrized with map_fd the iterator walks > > >> all elements of that map. > > >> cgroup iterator should have similar semantics. > > >> When non-parameterized it will walk all cgroups and their descendent > > >> depth first way. I believe that's what Yonghong is proposing. > > >> When parametrized it will start from that particular cgroup and > > >> walk all descendant of that cgroup only. > > >> The bpf prog can stop the iteration right away with ret 1. > > >> Maybe we can add two parameters. One -> cgroup_fd to use and another -> > > >> the order of iteration css_for_each_descendant_pre vs _post. > > >> wdyt? > > > > > > Sounds perfectly reasonable to me. > > > > This works for me too. Thanks! > > > > This sounds good to me. Thanks. Let's try to do it in the next iteration. Can we, in addition to descendant_pre and descendant_post walk algorithms also add the one that does ascendants walk (i.e., start from specified cgroup and walk up to the root cgroup)? I don't have specific example, but it seems natural to include it for "cgroup iterator" in general. Hopefully it won't add much code to the implementation.
On Mon, May 23, 2022 at 4:58 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote: > > > > On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote: > > > On 5/20/22 3:57 PM, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote: > > > >> We have bpf_map iterator that walks all bpf maps. > > > >> When map iterator is parametrized with map_fd the iterator walks > > > >> all elements of that map. > > > >> cgroup iterator should have similar semantics. > > > >> When non-parameterized it will walk all cgroups and their descendent > > > >> depth first way. I believe that's what Yonghong is proposing. > > > >> When parametrized it will start from that particular cgroup and > > > >> walk all descendant of that cgroup only. > > > >> The bpf prog can stop the iteration right away with ret 1. > > > >> Maybe we can add two parameters. One -> cgroup_fd to use and another -> > > > >> the order of iteration css_for_each_descendant_pre vs _post. > > > >> wdyt? > > > > > > > > Sounds perfectly reasonable to me. > > > > > > This works for me too. Thanks! > > > > > > > This sounds good to me. Thanks. Let's try to do it in the next iteration. > > Can we, in addition to descendant_pre and descendant_post walk > algorithms also add the one that does ascendants walk (i.e., start > from specified cgroup and walk up to the root cgroup)? I don't have > specific example, but it seems natural to include it for "cgroup > iterator" in general. Hopefully it won't add much code to the > implementation. Yep. Sounds reasonable and doable. It's just adding a flag to specify traversal order, like: { WALK_DESCENDANT_PRE, WALK_DESCENDANT_POST, WALK_PARENT_UP, }; In bpf_iter's seq_next(), change the algorithm to yield the parent of the current cgroup.
On Mon, May 23, 2022 at 5:53 PM Hao Luo <haoluo@google.com> wrote: > > On Mon, May 23, 2022 at 4:58 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Fri, May 20, 2022 at 7:35 PM Hao Luo <haoluo@google.com> wrote: > > > > > > On Fri, May 20, 2022 at 5:59 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 5/20/22 3:57 PM, Tejun Heo wrote: > > > > > Hello, > > > > > > > > > > On Fri, May 20, 2022 at 03:19:19PM -0700, Alexei Starovoitov wrote: > > > > >> We have bpf_map iterator that walks all bpf maps. > > > > >> When map iterator is parametrized with map_fd the iterator walks > > > > >> all elements of that map. > > > > >> cgroup iterator should have similar semantics. > > > > >> When non-parameterized it will walk all cgroups and their descendent > > > > >> depth first way. I believe that's what Yonghong is proposing. > > > > >> When parametrized it will start from that particular cgroup and > > > > >> walk all descendant of that cgroup only. > > > > >> The bpf prog can stop the iteration right away with ret 1. > > > > >> Maybe we can add two parameters. One -> cgroup_fd to use and another -> > > > > >> the order of iteration css_for_each_descendant_pre vs _post. > > > > >> wdyt? > > > > > > > > > > Sounds perfectly reasonable to me. > > > > > > > > This works for me too. Thanks! > > > > > > > > > > This sounds good to me. Thanks. Let's try to do it in the next iteration. > > > > Can we, in addition to descendant_pre and descendant_post walk > > algorithms also add the one that does ascendants walk (i.e., start > > from specified cgroup and walk up to the root cgroup)? I don't have > > specific example, but it seems natural to include it for "cgroup > > iterator" in general. Hopefully it won't add much code to the > > implementation. > > Yep. Sounds reasonable and doable. It's just adding a flag to specify > traversal order, like: > > { > WALK_DESCENDANT_PRE, > WALK_DESCENDANT_POST, > WALK_PARENT_UP, Probably something more like BPF_CG_WALK_DESCENDANT_PRE and so on? > }; > > In bpf_iter's seq_next(), change the algorithm to yield the parent of > the current cgroup.
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index c107392b0ba7..74c30fe20c23 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -44,6 +44,7 @@ struct kobject; struct mem_cgroup; struct module; struct bpf_func_state; +struct cgroup; extern struct idr btf_idr; extern spinlock_t btf_idr_lock; @@ -1581,6 +1582,7 @@ int bpf_obj_get_user(const char __user *pathname, int flags); struct bpf_iter_aux_info { struct bpf_map *map; + struct cgroup *cgroup; }; typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0210f85131b3..e5bc40d4bccc 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -91,6 +91,9 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + __u64 cgroup_id; + } cgroup; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ @@ -5965,6 +5968,9 @@ struct bpf_link_info { struct { __u32 map_id; } map; + struct { + __u64 cgroup_id; + } cgroup; }; } iter; struct { diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 057ba8e01e70..3e563b163d49 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -36,6 +36,9 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o obj-${CONFIG_BPF_LSM} += bpf_lsm.o endif obj-$(CONFIG_BPF_PRELOAD) += preload/ +ifeq ($(CONFIG_CGROUPS),y) +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o +endif obj-$(CONFIG_BPF_SYSCALL) += relo_core.o $(obj)/relo_core.o: $(srctree)/tools/lib/bpf/relo_core.c FORCE diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c new file mode 100644 index 000000000000..86bdfe135d24 --- /dev/null +++ b/kernel/bpf/cgroup_iter.c @@ -0,0 +1,148 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2022 Google */ +#include <linux/bpf.h> +#include <linux/btf_ids.h> +#include <linux/cgroup.h> +#include <linux/kernel.h> +#include <linux/seq_file.h> + +struct bpf_iter__cgroup { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct cgroup *, cgroup); +}; + +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) +{ + /* Only one session is supported. */ + if (*pos > 0) + return NULL; + + if (*pos == 0) + ++*pos; + + return *(struct cgroup **)seq->private; +} + +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + return NULL; +} + +static int cgroup_iter_seq_show(struct seq_file *seq, void *v) +{ + struct bpf_iter__cgroup ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + int ret = 0; + + ctx.meta = &meta; + ctx.cgroup = v; + meta.seq = seq; + prog = bpf_iter_get_info(&meta, false); + if (prog) + ret = bpf_iter_run_prog(prog, &ctx); + + return ret; +} + +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v) +{ +} + +static const struct seq_operations cgroup_iter_seq_ops = { + .start = cgroup_iter_seq_start, + .next = cgroup_iter_seq_next, + .stop = cgroup_iter_seq_stop, + .show = cgroup_iter_seq_show, +}; + +BTF_ID_LIST_SINGLE(bpf_cgroup_btf_id, struct, cgroup) + +static int cgroup_iter_seq_init(void *priv_data, struct bpf_iter_aux_info *aux) +{ + *(struct cgroup **)priv_data = aux->cgroup; + return 0; +} + +static const struct bpf_iter_seq_info cgroup_iter_seq_info = { + .seq_ops = &cgroup_iter_seq_ops, + .init_seq_private = cgroup_iter_seq_init, + .seq_priv_size = sizeof(struct cgroup *), +}; + +static int bpf_iter_attach_cgroup(struct bpf_prog *prog, + union bpf_iter_link_info *linfo, + struct bpf_iter_aux_info *aux) +{ + struct cgroup *cgroup; + + cgroup = cgroup_get_from_id(linfo->cgroup.cgroup_id); + if (!cgroup) + return -EBUSY; + + aux->cgroup = cgroup; + return 0; +} + +static void bpf_iter_detach_cgroup(struct bpf_iter_aux_info *aux) +{ + if (aux->cgroup) + cgroup_put(aux->cgroup); +} + +static void bpf_iter_cgroup_show_fdinfo(const struct bpf_iter_aux_info *aux, + struct seq_file *seq) +{ + char *buf; + + seq_printf(seq, "cgroup_id:\t%llu\n", cgroup_id(aux->cgroup)); + + buf = kmalloc(PATH_MAX, GFP_KERNEL); + if (!buf) { + seq_puts(seq, "cgroup_path:\n"); + return; + } + + /* If cgroup_path_ns() fails, buf will be an empty string, cgroup_path + * will print nothing. + * + * Cgroup_path is the path in the calliing process's cgroup namespace. + */ + cgroup_path_ns(aux->cgroup, buf, sizeof(buf), + current->nsproxy->cgroup_ns); + seq_printf(seq, "cgroup_path:\t%s\n", buf); + kfree(buf); +} + +static int bpf_iter_cgroup_fill_link_info(const struct bpf_iter_aux_info *aux, + struct bpf_link_info *info) +{ + info->iter.cgroup.cgroup_id = cgroup_id(aux->cgroup); + return 0; +} + +DEFINE_BPF_ITER_FUNC(cgroup, struct bpf_iter_meta *meta, + struct cgroup *cgroup) + +static struct bpf_iter_reg bpf_cgroup_reg_info = { + .target = "cgroup", + .attach_target = bpf_iter_attach_cgroup, + .detach_target = bpf_iter_detach_cgroup, + .show_fdinfo = bpf_iter_cgroup_show_fdinfo, + .fill_link_info = bpf_iter_cgroup_fill_link_info, + .ctx_arg_info_size = 1, + .ctx_arg_info = { + { offsetof(struct bpf_iter__cgroup, cgroup), + PTR_TO_BTF_ID }, + }, + .seq_info = &cgroup_iter_seq_info, +}; + +static int __init bpf_cgroup_iter_init(void) +{ + bpf_cgroup_reg_info.ctx_arg_info[0].btf_id = bpf_cgroup_btf_id[0]; + return bpf_iter_reg_target(&bpf_cgroup_reg_info); +} + +late_initcall(bpf_cgroup_iter_init); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 0210f85131b3..e5bc40d4bccc 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -91,6 +91,9 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + __u64 cgroup_id; + } cgroup; }; /* BPF syscall commands, see bpf(2) man-page for more details. */ @@ -5965,6 +5968,9 @@ struct bpf_link_info { struct { __u32 map_id; } map; + struct { + __u64 cgroup_id; + } cgroup; }; } iter; struct {