diff mbox series

[v2] cgroup: Kill the parent controller when its last child is killed

Message ID 20220404142535.145975-1-minhquangbui99@gmail.com (mailing list archive)
State Not Applicable
Headers show
Series [v2] cgroup: Kill the parent controller when its last child is killed | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Bui Quang Minh April 4, 2022, 2:25 p.m. UTC
When umounting a cgroup controller, in case the controller has no children,
the initial ref will be dropped in cgroup_kill_sb. In cgroup_rmdir path,
the controller is deleted from the parent's children list in
css_release_work_fn, which is run on a kernel worker.

With this simple script

	#!/bin/sh

	mount -t cgroup -o none,name=test test ./tmp
	mkdir -p ./tmp/abc

	rmdir ./tmp/abc
	umount ./tmp

	sleep 5
	cat /proc/self/cgroup

The rmdir will remove the last child and umount is expected to kill the
parent controller. However, when running the above script, we may get

	1:name=test:/

This shows that the parent controller has not been killed. The reason is
after rmdir is completed, it is not guaranteed that the parent's children
list is empty as css_release_work_fn is deferred to run on a worker. In
case cgroup_kill_sb is run before that work, it does not drop the initial
ref. Later in the worker, it just removes the child from the list without
checking the list is empty to kill the parent controller. As a result, the
parent controller still has the initial ref but without any logical refs
(children ref, mount ref).

This commit adds a free parent controller path into the worker function to
free up the parent controller when the last child is killed.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
---
v2: Fix compilation error when CONFIG_CGROUP_BPF is not set

 kernel/cgroup/cgroup.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)


base-commit: 1be9b7206b7dbff54b223eee7ef3bc91b80433aa

Comments

Tejun Heo April 4, 2022, 5:37 p.m. UTC | #1
Hello,

On Mon, Apr 04, 2022 at 09:25:34PM +0700, Bui Quang Minh wrote:
> When umounting a cgroup controller, in case the controller has no children,
> the initial ref will be dropped in cgroup_kill_sb. In cgroup_rmdir path,
> the controller is deleted from the parent's children list in
> css_release_work_fn, which is run on a kernel worker.
> 
> With this simple script
> 
> 	#!/bin/sh
> 
> 	mount -t cgroup -o none,name=test test ./tmp
> 	mkdir -p ./tmp/abc
> 
> 	rmdir ./tmp/abc
> 	umount ./tmp
> 
> 	sleep 5
> 	cat /proc/self/cgroup
> 
> The rmdir will remove the last child and umount is expected to kill the
> parent controller. However, when running the above script, we may get
> 
> 	1:name=test:/

First of all, remounting after active use isn't a well-supported use case as
documented in the admin doc. The problem is that there's no finite time
horizon around when all the references are gonna go away - some references
may be held in cache which may not be released unless certain conditions are
met. So, while changing hierarchy configuration is useful for system setup,
development and debugging, for production use, it's a boot time
configuration mechanism.

> This shows that the parent controller has not been killed. The reason is
> after rmdir is completed, it is not guaranteed that the parent's children
> list is empty as css_release_work_fn is deferred to run on a worker. In
> case cgroup_kill_sb is run before that work, it does not drop the initial
> ref. Later in the worker, it just removes the child from the list without
> checking the list is empty to kill the parent controller. As a result, the
> parent controller still has the initial ref but without any logical refs
> (children ref, mount ref).
> 
> This commit adds a free parent controller path into the worker function to
> free up the parent controller when the last child is killed.

And the suggested behavior doesn't make much sense to me. It doesn't
actually solve the underlying problem but instead always make css
destructions recursive which can lead to surprises for normal use cases.

Thanks.
Michal Koutný April 5, 2022, 9:11 a.m. UTC | #2
On Mon, Apr 04, 2022 at 07:37:24AM -1000, Tejun Heo <tj@kernel.org> wrote:
> And the suggested behavior doesn't make much sense to me. It doesn't
> actually solve the underlying problem but instead always make css
> destructions recursive which can lead to surprises for normal use cases.

I also don't like the nested special-case use percpu_ref_kill().

I looked at this and my supposed solution turned out to be a revert of
commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory
controller lifetime"). So at the unmount time it's necessary to distinguish
children that are in the process of removal from children than are online or
pinned indefinitely.

What about:

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb)
        struct cgroup_root *root = cgroup_root_from_kf(kf_root);

        /*
-        * If @root doesn't have any children, start killing it.
+        * If @root doesn't have any children held by residual state (e.g.
+        * memory controller), start killing it, flush workqueue to filter out
+        * transiently offlined children.
         * This prevents new mounts by disabling percpu_ref_tryget_live().
         *
         * And don't kill the default root.
         */
+       flush_workqueue(cgroup_destroy_wq);
        if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
            !percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
                cgroup_bpf_offline(&root->cgrp);

(I suspect there's technically still possible a race between concurrent unmount
and the last rmdir but the flush on kill_sb path should be affordable and it
prevents unnecessarily conserved cgroup roots.)

Michal
Bui Quang Minh April 5, 2022, 2:58 p.m. UTC | #3
On 4/5/22 16:11, Michal Koutný wrote:
> On Mon, Apr 04, 2022 at 07:37:24AM -1000, Tejun Heo <tj@kernel.org> wrote:
>> And the suggested behavior doesn't make much sense to me. It doesn't
>> actually solve the underlying problem but instead always make css
>> destructions recursive which can lead to surprises for normal use cases.
> 
> I also don't like the nested special-case use percpu_ref_kill().

After thinking more carefully, I agree with your points. The recursive 
css destruction only does not fixup the previous parents' metadata 
correctly and it is not a desirable behavior too.

> I looked at this and my supposed solution turned out to be a revert of
> commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory
> controller lifetime"). So at the unmount time it's necessary to distinguish
> children that are in the process of removal from children than are online or
> pinned indefinitely.
> 
> What about:
> 
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb)
>          struct cgroup_root *root = cgroup_root_from_kf(kf_root);
> 
>          /*
> -        * If @root doesn't have any children, start killing it.
> +        * If @root doesn't have any children held by residual state (e.g.
> +        * memory controller), start killing it, flush workqueue to filter out
> +        * transiently offlined children.
>           * This prevents new mounts by disabling percpu_ref_tryget_live().
>           *
>           * And don't kill the default root.
>           */
> +       flush_workqueue(cgroup_destroy_wq);
>          if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
>              !percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
>                  cgroup_bpf_offline(&root->cgrp);
> 
> (I suspect there's technically still possible a race between concurrent unmount
> and the last rmdir but the flush on kill_sb path should be affordable and it
> prevents unnecessarily conserved cgroup roots.)

Your proposed solution looks good to me. As with my example the flush 
will guarantee the rmdir and its deferred work has been executed before 
cleaning up in umount path.

But what do you think about

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f01ff231a484..5578ee76e789 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2215,6 +2215,7 @@ static void cgroup_kill_sb(struct super_block *sb)
                 cgroup_bpf_offline(&root->cgrp);
                 percpu_ref_kill(&root->cgrp.self.refcnt);
         }
+       root->cgrp.flags |= CGRP_UMOUNT;
         cgroup_put(&root->cgrp);
         kernfs_kill_sb(sb);
  }
@@ -5152,12 +5153,28 @@ static void css_release_work_fn(struct 
work_struct *work)
                 container_of(work, struct cgroup_subsys_state, 
destroy_work);
         struct cgroup_subsys *ss = css->ss;
         struct cgroup *cgrp = css->cgroup;
+       struct cgroup *parent = cgroup_parent(cgrp);

         mutex_lock(&cgroup_mutex);

         css->flags |= CSS_RELEASED;
         list_del_rcu(&css->sibling);

+       /*
+        * If parent doesn't have any children, start killing it.
+        * And don't kill the default root.
+        */
+       if (parent && list_empty(&parent->self.children) &&
+           parent->flags & CGRP_UMOUNT &&
+           parent != &cgrp_dfl_root.cgrp &&
+           !percpu_ref_is_dying(&parent->self.refcnt)) {
+#ifdef CONFIG_CGROUP_BPF
+               if (!percpu_ref_is_dying(&cgrp->bpf.refcnt))
+                       cgroup_bpf_offline(parent);
+#endif
+               percpu_ref_kill(&parent->self.refcnt);
+       }
+
         if (ss) {
                 /* css release path */
                 if (!list_empty(&css->rstat_css_node)) {

The idea is to set a flag in the umount path, in the rmdir it will 
destroy the css in case its direct parent is umounted, no recursive 
here. This is just an incomplete example, we may need to reset that flag 
when remounting.

Thanks,
Quang Minh.
Tejun Heo April 21, 2022, 10:37 p.m. UTC | #4
On Tue, Apr 05, 2022 at 09:58:01PM +0700, Bui Quang Minh wrote:
> @@ -5152,12 +5153,28 @@ static void css_release_work_fn(struct work_struct
> *work)
>                 container_of(work, struct cgroup_subsys_state,
> destroy_work);
>         struct cgroup_subsys *ss = css->ss;
>         struct cgroup *cgrp = css->cgroup;
> +       struct cgroup *parent = cgroup_parent(cgrp);
> 
>         mutex_lock(&cgroup_mutex);
> 
>         css->flags |= CSS_RELEASED;
>         list_del_rcu(&css->sibling);
> 
> +       /*
> +        * If parent doesn't have any children, start killing it.
> +        * And don't kill the default root.
> +        */
> +       if (parent && list_empty(&parent->self.children) &&
> +           parent->flags & CGRP_UMOUNT &&
> +           parent != &cgrp_dfl_root.cgrp &&
> +           !percpu_ref_is_dying(&parent->self.refcnt)) {
> +#ifdef CONFIG_CGROUP_BPF
> +               if (!percpu_ref_is_dying(&cgrp->bpf.refcnt))
> +                       cgroup_bpf_offline(parent);
> +#endif
> +               percpu_ref_kill(&parent->self.refcnt);
> +       }
> +
>         if (ss) {
>                 /* css release path */
>                 if (!list_empty(&css->rstat_css_node)) {
> 
> The idea is to set a flag in the umount path, in the rmdir it will destroy
> the css in case its direct parent is umounted, no recursive here. This is
> just an incomplete example, we may need to reset that flag when remounting.

I'm generally against adding complexities for this given that it's never
gonna be actually reliable. If adding one liner flush_workqueue makes life
easier in some cases, why not? But the root cause is something which can't
be solved from messing with release / umount paths and something we decided
against supporting.

Thanks.
diff mbox series

Patch

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index f01ff231a484..1916070f0d59 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5152,12 +5152,27 @@  static void css_release_work_fn(struct work_struct *work)
 		container_of(work, struct cgroup_subsys_state, destroy_work);
 	struct cgroup_subsys *ss = css->ss;
 	struct cgroup *cgrp = css->cgroup;
+	struct cgroup *parent = cgroup_parent(cgrp);
 
 	mutex_lock(&cgroup_mutex);
 
 	css->flags |= CSS_RELEASED;
 	list_del_rcu(&css->sibling);
 
+	/*
+	 * If parent doesn't have any children, start killing it.
+	 * And don't kill the default root.
+	 */
+	if (parent && list_empty(&parent->self.children) &&
+	    parent != &cgrp_dfl_root.cgrp &&
+	    !percpu_ref_is_dying(&parent->self.refcnt)) {
+#ifdef CONFIG_CGROUP_BPF
+		if (!percpu_ref_is_dying(&cgrp->bpf.refcnt))
+			cgroup_bpf_offline(parent);
+#endif
+		percpu_ref_kill(&parent->self.refcnt);
+	}
+
 	if (ss) {
 		/* css release path */
 		if (!list_empty(&css->rstat_css_node)) {