diff mbox series

[RFC,3/3] cgroup: Do not take css_set_lock in cgroup_show_path

Message ID 20230502133847.14570-4-mkoutny@suse.com (mailing list archive)
State Mainlined, archived
Headers show
Series Rework locking when rendering mountinfo cgroup paths | expand

Commit Message

Michal Koutný May 2, 2023, 1:38 p.m. UTC
/proc/$pid/mountinfo may accumulate lots of entries (causing frequent
re-reads of whole file) or lots cgroupfs entries alone.
The cgroupfs entries rendered with cgroup_show_path() may increase/be
subject of css_set_lock contention causing further slowdown -- not only
mountinfo rendering but any other css_set_lock user.

We leverage the fact that mountinfo reading happens with namespace_sem
taken and hierarchy roots thus cannot be freed concurrently.

There are three relevant nodes for each cgroupfs entry:

        R ... cgroup hierarchy root
        M ... mount root
        C ... reader's cgroup NS root

mountinfo is supposed to show path from C to M.

Current's css_set (and linked root cgroups) are stable under
namespace_sem, therefore current_cgns_cgroup_from_root() doesn't need
css_set_lock.

When the path is assembled in kernfs_path_from_node(), we know that:
- C kernfs_node is (transitively) pinned via current->nsproxy,
- M kernfs_node is pinned thanks to namespace_sem,
- path C to M is pinned via child->parent references (this holds also
  when C and M are in distinct subtrees).

Streamline mountinfo rendering a bit by relieving css_set_lock and add
careful notes about that.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 kernel/cgroup/cgroup.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

Waiman Long May 2, 2023, 7:56 p.m. UTC | #1
On 5/2/23 09:38, Michal Koutný wrote:
> /proc/$pid/mountinfo may accumulate lots of entries (causing frequent
> re-reads of whole file) or lots cgroupfs entries alone.
> The cgroupfs entries rendered with cgroup_show_path() may increase/be
> subject of css_set_lock contention causing further slowdown -- not only
> mountinfo rendering but any other css_set_lock user.
>
> We leverage the fact that mountinfo reading happens with namespace_sem
> taken and hierarchy roots thus cannot be freed concurrently.
>
> There are three relevant nodes for each cgroupfs entry:
>
>          R ... cgroup hierarchy root
>          M ... mount root
>          C ... reader's cgroup NS root
>
> mountinfo is supposed to show path from C to M.
>
> Current's css_set (and linked root cgroups) are stable under
> namespace_sem, therefore current_cgns_cgroup_from_root() doesn't need
> css_set_lock.
>
> When the path is assembled in kernfs_path_from_node(), we know that:
> - C kernfs_node is (transitively) pinned via current->nsproxy,
> - M kernfs_node is pinned thanks to namespace_sem,
> - path C to M is pinned via child->parent references (this holds also
>    when C and M are in distinct subtrees).
>
> Streamline mountinfo rendering a bit by relieving css_set_lock and add
> careful notes about that.
>
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>   kernel/cgroup/cgroup.c | 12 +++++++-----
>   1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 32d693a797b9..e2ec6f0028be 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1407,12 +1407,18 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
>   	struct cgroup *res_cgroup = NULL;
>   
>   	if (cset == &init_css_set) {
> +		/* callers must ensure root stability */
>   		res_cgroup = &root->cgrp;
>   	} else if (root == &cgrp_dfl_root) {
>   		res_cgroup = cset->dfl_cgrp;
>   	} else {
>   		struct cgrp_cset_link *link;
> -		lockdep_assert_held(&css_set_lock);
> +		/* cset's cgroups are pinned unless they are root cgroups that
> +		 * were unmounted.  We look at links to !cgrp_dfl_root
> +		 * cgroup_root, either lock ensures the list is not mutated
> +		 */
> +		lockdep_assert(lockdep_is_held(&css_set_lock) ||
> +			       lockdep_is_held_type(&namespace_sem, -1));
Again lockdep_is_held(&namespace_sem) is good enough.

>   
>   		list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
>   			struct cgroup *c = link->cgrp;
> @@ -1438,8 +1444,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root)
>   	struct cgroup *res = NULL;
>   	struct css_set *cset;
>   
> -	lockdep_assert_held(&css_set_lock);
> -
>   	/* namespace_sem ensures `root` stability on unmount */
>   	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
>   
> @@ -1905,10 +1909,8 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
>   	if (!buf)
>   		return -ENOMEM;
>   
> -	spin_lock_irq(&css_set_lock);
>   	ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
>   	len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
> -	spin_unlock_irq(&css_set_lock);
>   
>   	if (len >= PATH_MAX)
>   		len = -ERANGE;
Cheers,
Longman
Tejun Heo May 5, 2023, 3:45 p.m. UTC | #2
Hello,

On Tue, May 02, 2023 at 03:38:47PM +0200, Michal Koutný wrote:
> /proc/$pid/mountinfo may accumulate lots of entries (causing frequent
> re-reads of whole file) or lots cgroupfs entries alone.
> The cgroupfs entries rendered with cgroup_show_path() may increase/be
> subject of css_set_lock contention causing further slowdown -- not only
> mountinfo rendering but any other css_set_lock user.
> 
> We leverage the fact that mountinfo reading happens with namespace_sem
> taken and hierarchy roots thus cannot be freed concurrently.
> 
> There are three relevant nodes for each cgroupfs entry:
> 
>         R ... cgroup hierarchy root
>         M ... mount root
>         C ... reader's cgroup NS root
> 
> mountinfo is supposed to show path from C to M.

At least for cgroup2, the path from C to M isn't gonna change once NS is
established, right? Nothing can be moved or renamed while the NS root is
there. If so, can't we just cache the formatted path and return the same
thing without any locking? The proposed changes seem a bit too brittle to
me.

Thanks.
Michal Koutný May 5, 2023, 5:32 p.m. UTC | #3
On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > There are three relevant nodes for each cgroupfs entry:
> > 
> >         R ... cgroup hierarchy root
> >         M ... mount root
> >         C ... reader's cgroup NS root
> > 
> > mountinfo is supposed to show path from C to M.
> 
> At least for cgroup2, the path from C to M isn't gonna change once NS is
> established, right?

Right. Although, the argument about M (when C above M or when C and M in
different subtrees) implicitly relies on the namespace_sem.

> Nothing can be moved or renamed while the NS root is there. If so,
> can't we just cache the formatted path and return the same thing
> without any locking?

Here I find the caching complexity not worth it for v2 only (also C is
in the eye of the beholder) -- because then css_set_lock can be dropped
from cgroup_show_path with simpler reasoning.

(Sadly, the bigger benefit would be on hybrid setups multiplied by the
number of v1 hierarchies listed.)

(OTOH, the caching argument and weights for it may be different for
/proc/$pid/cgroup.)

> The proposed changes seem a bit too brittle to me.

OK, I'll look into separate cgroup_show_path and cgroup1_show_path
approach.

Thanks,
Michal
Tejun Heo May 5, 2023, 6:17 p.m. UTC | #4
Hello,

On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote:
> On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > > There are three relevant nodes for each cgroupfs entry:
> > > 
> > >         R ... cgroup hierarchy root
> > >         M ... mount root
> > >         C ... reader's cgroup NS root
> > > 
> > > mountinfo is supposed to show path from C to M.
> > 
> > At least for cgroup2, the path from C to M isn't gonna change once NS is
> > established, right?
> 
> Right. Although, the argument about M (when C above M or when C and M in
> different subtrees) implicitly relies on the namespace_sem.

I don't follow. Can you please elaborate a bit more?

Thanks.
Michal Koutný May 9, 2023, 10:34 a.m. UTC | #5
On Fri, May 05, 2023 at 08:17:10AM -1000, Tejun Heo <tj@kernel.org> wrote:
> On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote:
> > On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > > > There are three relevant nodes for each cgroupfs entry:
> > > > 
> > > >         R ... cgroup hierarchy root
> > > >         M ... mount root
> > > >         C ... reader's cgroup NS root
> > > > 
> > > > mountinfo is supposed to show path from C to M.
> > > 
> > > At least for cgroup2, the path from C to M isn't gonna change once NS is
> > > established, right?
> > 
> > Right. Although, the argument about M (when C above M or when C and M in
> > different subtrees) implicitly relies on the namespace_sem.
> 
> I don't follow. Can you please elaborate a bit more?

I wanted to say that even with restriction to cgroup2, the css_set_lock
removal would also rely on namespace_sem.

For a given mountinfo entry the path C--M won't change (no renames).
The question is whether cgroup M will stay around (with the relaxed
locking):

  - C >= M (C is below M) 
    -> C (transitively) pins M

  - C < M (C is above M) or C and M are in two disjoint subtrees (path
    goes through a common ancestor)
    -> M could be released without relation to C (even on cgroup2, with
       the css_set_lock removed) but such a destructive operation on M
       is excluded as long as namespace_sem is held during entry
       rendering.

Does that clarify the trade-off of removing css_set_lock at this spot?

Thanks,
Michal
Tejun Heo May 22, 2023, 8:55 p.m. UTC | #6
Hello, Michal.

Sorry about the delay.

On Tue, May 09, 2023 at 12:34:53PM +0200, Michal Koutný wrote:
> On Fri, May 05, 2023 at 08:17:10AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote:
> > > On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote:
> > > > > There are three relevant nodes for each cgroupfs entry:
> > > > > 
> > > > >         R ... cgroup hierarchy root
> > > > >         M ... mount root
> > > > >         C ... reader's cgroup NS root
> > > > > 
> > > > > mountinfo is supposed to show path from C to M.
> > > > 
> > > > At least for cgroup2, the path from C to M isn't gonna change once NS is
> > > > established, right?
> > > 
> > > Right. Although, the argument about M (when C above M or when C and M in
> > > different subtrees) implicitly relies on the namespace_sem.
> > 
> > I don't follow. Can you please elaborate a bit more?
> 
> I wanted to say that even with restriction to cgroup2, the css_set_lock
> removal would also rely on namespace_sem.
> 
> For a given mountinfo entry the path C--M won't change (no renames).
> The question is whether cgroup M will stay around (with the relaxed
> locking):
> 
>   - C >= M (C is below M) 
>     -> C (transitively) pins M

Yeah, this was what I was thinking.

>   - C < M (C is above M) or C and M are in two disjoint subtrees (path
>     goes through a common ancestor)
>     -> M could be released without relation to C (even on cgroup2, with
>        the css_set_lock removed) but such a destructive operation on M
>        is excluded as long as namespace_sem is held during entry
>        rendering.
> 
> Does that clarify the trade-off of removing css_set_lock at this spot?

Right, you can have cgroup outside NS root still mounted and that mount root
can be viewed from multiple cgroup NS's, so the the path isn't fixed either.

Having enough lockdep annotations should do but if reasonable the preference
is being a bit more self-contained.

Thanks for the explanation.
diff mbox series

Patch

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 32d693a797b9..e2ec6f0028be 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1407,12 +1407,18 @@  static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
 	struct cgroup *res_cgroup = NULL;
 
 	if (cset == &init_css_set) {
+		/* callers must ensure root stability */
 		res_cgroup = &root->cgrp;
 	} else if (root == &cgrp_dfl_root) {
 		res_cgroup = cset->dfl_cgrp;
 	} else {
 		struct cgrp_cset_link *link;
-		lockdep_assert_held(&css_set_lock);
+		/* cset's cgroups are pinned unless they are root cgroups that
+		 * were unmounted.  We look at links to !cgrp_dfl_root
+		 * cgroup_root, either lock ensures the list is not mutated
+		 */
+		lockdep_assert(lockdep_is_held(&css_set_lock) ||
+			       lockdep_is_held_type(&namespace_sem, -1));
 
 		list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
 			struct cgroup *c = link->cgrp;
@@ -1438,8 +1444,6 @@  current_cgns_cgroup_from_root(struct cgroup_root *root)
 	struct cgroup *res = NULL;
 	struct css_set *cset;
 
-	lockdep_assert_held(&css_set_lock);
-
 	/* namespace_sem ensures `root` stability on unmount */
 	lockdep_assert(lockdep_is_held_type(&namespace_sem, -1));
 
@@ -1905,10 +1909,8 @@  int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
 	if (!buf)
 		return -ENOMEM;
 
-	spin_lock_irq(&css_set_lock);
 	ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
 	len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
-	spin_unlock_irq(&css_set_lock);
 
 	if (len >= PATH_MAX)
 		len = -ERANGE;