diff mbox series

[RFC,bpf-next,1/5] cgroup: Enable task_under_cgroup_hierarchy() on cgroup1

Message ID 20230903142800.3870-2-laoar.shao@gmail.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series bpf, cgroup: Enable cgroup_array map on cgroup1 | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-0 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-1 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-12 fail Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 fail Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-16 fail Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 fail Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-10 fail Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 fail Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 fail Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 fail Logs for test_progs on s390x with gcc
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 10577 this patch: 10577
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 2855 this patch: 2855
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 11463 this patch: 11463
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 36 lines checked
netdev/kdoc success Errors and warnings before: 3 this patch: 3
netdev/source_inline success Was 0 now: 0

Commit Message

Yafang Shao Sept. 3, 2023, 2:27 p.m. UTC
Currently, the function task_under_cgroup_hierarchy() allows us to
determine if a task resides exclusively within a cgroup2 hierarchy.
Nevertheless, given the continued prevalence of cgroup1, it's useful that
we make a minor adjustment to extend its functionality to cgroup1 as well.
Once this modification is implemented, we will have the ability to
effortlessly verify a task's cgroup membership within BPF programs. For
instance, we can easily check if a task belongs to a cgroup1 directory,
such as /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/ or
/sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/cgroup.h | 24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

Comments

Alexei Starovoitov Sept. 6, 2023, 7:53 p.m. UTC | #1
On Sun, Sep 03, 2023 at 02:27:56PM +0000, Yafang Shao wrote:
> Currently, the function task_under_cgroup_hierarchy() allows us to
> determine if a task resides exclusively within a cgroup2 hierarchy.
> Nevertheless, given the continued prevalence of cgroup1, it's useful that
> we make a minor adjustment to extend its functionality to cgroup1 as well.
> Once this modification is implemented, we will have the ability to
> effortlessly verify a task's cgroup membership within BPF programs. For
> instance, we can easily check if a task belongs to a cgroup1 directory,
> such as /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/ or
> /sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/cgroup.h | 24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b307013..5414a2c 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -543,15 +543,33 @@ static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp,
>   * @ancestor: possible ancestor of @task's cgroup
>   *
>   * Tests whether @task's default cgroup hierarchy is a descendant of @ancestor.
> - * It follows all the same rules as cgroup_is_descendant, and only applies
> - * to the default hierarchy.
> + * It follows all the same rules as cgroup_is_descendant.
>   */
>  static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
>  					       struct cgroup *ancestor)
>  {
>  	struct css_set *cset = task_css_set(task);
> +	struct cgroup *cgrp;
> +	bool ret = false;
> +	int ssid;
> +
> +	if (ancestor->root == &cgrp_dfl_root)
> +		return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> +
> +	for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
> +		if (!ancestor->subsys[ssid])
> +			continue;

This looks wrong. I believe cgroup_mutex should be held to iterate.

Tejun ?

>  
> -	return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> +		cgrp = task_css(task, ssid)->cgroup;
> +		if (!cgrp)
> +			continue;
> +
> +		if (!cgroup_is_descendant(cgrp, ancestor))
> +			return false;
> +		if (!ret)
> +			ret = true;
> +	}
> +	return ret;
>  }
>  
>  /* no synchronization, the result can only be used as a hint */
> -- 
> 1.8.3.1
>
Tejun Heo Sept. 6, 2023, 8:13 p.m. UTC | #2
Hello,

On Sun, Sep 03, 2023 at 02:27:56PM +0000, Yafang Shao wrote:
>  static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
>  					       struct cgroup *ancestor)
>  {
>  	struct css_set *cset = task_css_set(task);
> +	struct cgroup *cgrp;
> +	bool ret = false;
> +	int ssid;
> +
> +	if (ancestor->root == &cgrp_dfl_root)
> +		return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> +
> +	for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
> +		if (!ancestor->subsys[ssid])
> +			continue;
>  
> -	return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> +		cgrp = task_css(task, ssid)->cgroup;
> +		if (!cgrp)
> +			continue;
> +
> +		if (!cgroup_is_descendant(cgrp, ancestor))
> +			return false;
> +		if (!ret)
> +			ret = true;
> +	}
> +	return ret;

I feel ambivalent about adding support for this in cgroup1 especially given
that this can only work for fd based interface which is worse than the ID
based ones. Even if we're doing this, the above is definitely not what we
want to do as it won't work for controller-less hierarchies like the one
that systemd used to use. You'd have to lock css_set_lock and walk the
cgpr_cset_links.

Thanks.
Yafang Shao Sept. 7, 2023, 3:05 a.m. UTC | #3
On Thu, Sep 7, 2023 at 4:13 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Sun, Sep 03, 2023 at 02:27:56PM +0000, Yafang Shao wrote:
> >  static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
> >                                              struct cgroup *ancestor)
> >  {
> >       struct css_set *cset = task_css_set(task);
> > +     struct cgroup *cgrp;
> > +     bool ret = false;
> > +     int ssid;
> > +
> > +     if (ancestor->root == &cgrp_dfl_root)
> > +             return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> > +
> > +     for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
> > +             if (!ancestor->subsys[ssid])
> > +                     continue;
> >
> > -     return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> > +             cgrp = task_css(task, ssid)->cgroup;
> > +             if (!cgrp)
> > +                     continue;
> > +
> > +             if (!cgroup_is_descendant(cgrp, ancestor))
> > +                     return false;
> > +             if (!ret)
> > +                     ret = true;
> > +     }
> > +     return ret;
>
> I feel ambivalent about adding support for this in cgroup1 especially given
> that this can only work for fd based interface which is worse than the ID
> based ones.

The fd-based cgroup interface plays a crucial role in BPF programs,
particularly in components such as cgroup_iter, bpf_cgrp_storage, and
cgroup_array maps, as well as in the attachment and detachment of
cgroups.

However, it's important to note that as far as my knowledge goes,
bpf_cgrp_storage, cgroup_array, and the attachment/detachment of
cgroups are exclusively compatible with the cgroup fd-based interface.
Unfortunately, all these functionalities are limited to cgroup2, which
poses challenges in containerized environments.

In our pursuit of enabling seamless BPF integration within our
Kubernetes environment, we've been exploring the possibility of
transitioning from cgroup1 to cgroup2. This transition, while
desirable for its future-forward nature, presents complexities due to
the need for numerous applications to adapt.

We acknowledge that cgroup2 represents the future, but we also
understand that such transitions require time and effort.
Consequently, we are considering an alternative approach. Rather than
migrating to cgroup2, we are contemplating modifications to the BPF
kernel code to ensure compatibility with cgroup1. Moreover, it appears
that these modifications may entail only minor adjustments, making
this option more palatable.

> Even if we're doing this, the above is definitely not what we
> want to do as it won't work for controller-less hierarchies like the one
> that systemd used to use.

Right. It can't work for /sys/fs/cgroup/systemd/.

> You'd have to lock css_set_lock and walk the
> cgpr_cset_links.

That seems better. Will investigate it.  Thanks for your suggestion.

--
Regards
Yafang
Tejun Heo Sept. 11, 2023, 8:27 p.m. UTC | #4
On Thu, Sep 07, 2023 at 11:05:07AM +0800, Yafang Shao wrote:
> The fd-based cgroup interface plays a crucial role in BPF programs,
> particularly in components such as cgroup_iter, bpf_cgrp_storage, and
> cgroup_array maps, as well as in the attachment and detachment of
> cgroups.

Yeah, I know they're used. It's just that they are inferior identifiers from
cgroup's POV as they are ephemeral, can't easily be transferred across
process boundaries or persisted beyond the lifetime of the fd-owing process
or the cgroups which are being pointed to.

Thanks.
Michal Koutný Sept. 18, 2023, 2:45 p.m. UTC | #5
On Sun, Sep 03, 2023 at 02:27:56PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
>  static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
>  					       struct cgroup *ancestor)
>  {
>  	struct css_set *cset = task_css_set(task);
> +	struct cgroup *cgrp;
> +	bool ret = false;
> +	int ssid;
> +
> +	if (ancestor->root == &cgrp_dfl_root)
> +		return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> +
> +	for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {

This loop were better an iteration over cset->cgrp_links to handle any
v1 hierarchy (under css_set_lock :-/).

> +		if (!ancestor->subsys[ssid])
> +			continue;
>  
> -	return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> +		cgrp = task_css(task, ssid)->cgroup;

Does this pass on a lockdep-enabled kernel?

See conditions in task_css_set_check(), it seems at least RCU read lock
would be needed (if not going through cgrp_links mentioned above).

HTH,
Michal
Yafang Shao Sept. 19, 2023, 5:42 a.m. UTC | #6
On Mon, Sep 18, 2023 at 10:45 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Sun, Sep 03, 2023 at 02:27:56PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> >  static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
> >                                              struct cgroup *ancestor)
> >  {
> >       struct css_set *cset = task_css_set(task);
> > +     struct cgroup *cgrp;
> > +     bool ret = false;
> > +     int ssid;
> > +
> > +     if (ancestor->root == &cgrp_dfl_root)
> > +             return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> > +
> > +     for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
>
> This loop were better an iteration over cset->cgrp_links to handle any
> v1 hierarchy (under css_set_lock :-/).

Agree. That is better.

>
> > +             if (!ancestor->subsys[ssid])
> > +                     continue;
> >
> > -     return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
> > +             cgrp = task_css(task, ssid)->cgroup;
>
> Does this pass on a lockdep-enabled kernel?

Yes, the lockdep is enabled.

>
> See conditions in task_css_set_check(), it seems at least RCU read lock
> would be needed (if not going through cgrp_links mentioned above).

All the call sites of it are already under RCU protection, so we don't
need to explicitly set RCU read lock here.
diff mbox series

Patch

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b307013..5414a2c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -543,15 +543,33 @@  static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp,
  * @ancestor: possible ancestor of @task's cgroup
  *
  * Tests whether @task's default cgroup hierarchy is a descendant of @ancestor.
- * It follows all the same rules as cgroup_is_descendant, and only applies
- * to the default hierarchy.
+ * It follows all the same rules as cgroup_is_descendant.
  */
 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 					       struct cgroup *ancestor)
 {
 	struct css_set *cset = task_css_set(task);
+	struct cgroup *cgrp;
+	bool ret = false;
+	int ssid;
+
+	if (ancestor->root == &cgrp_dfl_root)
+		return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
+
+	for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
+		if (!ancestor->subsys[ssid])
+			continue;
 
-	return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
+		cgrp = task_css(task, ssid)->cgroup;
+		if (!cgrp)
+			continue;
+
+		if (!cgroup_is_descendant(cgrp, ancestor))
+			return false;
+		if (!ret)
+			ret = true;
+	}
+	return ret;
 }
 
 /* no synchronization, the result can only be used as a hint */