Message ID | 20230502133847.14570-4-mkoutny@suse.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | Rework locking when rendering mountinfo cgroup paths | expand |
On 5/2/23 09:38, Michal Koutný wrote: > /proc/$pid/mountinfo may accumulate lots of entries (causing frequent > re-reads of whole file) or lots cgroupfs entries alone. > The cgroupfs entries rendered with cgroup_show_path() may increase/be > subject of css_set_lock contention causing further slowdown -- not only > mountinfo rendering but any other css_set_lock user. > > We leverage the fact that mountinfo reading happens with namespace_sem > taken and hierarchy roots thus cannot be freed concurrently. > > There are three relevant nodes for each cgroupfs entry: > > R ... cgroup hierarchy root > M ... mount root > C ... reader's cgroup NS root > > mountinfo is supposed to show path from C to M. > > Current's css_set (and linked root cgroups) are stable under > namespace_sem, therefore current_cgns_cgroup_from_root() doesn't need > css_set_lock. > > When the path is assembled in kernfs_path_from_node(), we know that: > - C kernfs_node is (transitively) pinned via current->nsproxy, > - M kernfs_node is pinned thanks to namespace_sem, > - path C to M is pinned via child->parent references (this holds also > when C and M are in distinct subtrees). > > Streamline mountinfo rendering a bit by relieving css_set_lock and add > careful notes about that. > > Signed-off-by: Michal Koutný <mkoutny@suse.com> > --- > kernel/cgroup/cgroup.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 32d693a797b9..e2ec6f0028be 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -1407,12 +1407,18 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset, > struct cgroup *res_cgroup = NULL; > > if (cset == &init_css_set) { > + /* callers must ensure root stability */ > res_cgroup = &root->cgrp; > } else if (root == &cgrp_dfl_root) { > res_cgroup = cset->dfl_cgrp; > } else { > struct cgrp_cset_link *link; > - lockdep_assert_held(&css_set_lock); > + /* cset's cgroups are pinned unless they are root cgroups that > + * were unmounted. We look at links to !cgrp_dfl_root > + * cgroup_root, either lock ensures the list is not mutated > + */ > + lockdep_assert(lockdep_is_held(&css_set_lock) || > + lockdep_is_held_type(&namespace_sem, -1)); Again lockdep_is_held(&namespace_sem) is good enough. > > list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { > struct cgroup *c = link->cgrp; > @@ -1438,8 +1444,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root) > struct cgroup *res = NULL; > struct css_set *cset; > > - lockdep_assert_held(&css_set_lock); > - > /* namespace_sem ensures `root` stability on unmount */ > lockdep_assert(lockdep_is_held_type(&namespace_sem, -1)); > > @@ -1905,10 +1909,8 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, > if (!buf) > return -ENOMEM; > > - spin_lock_irq(&css_set_lock); > ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot); > len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX); > - spin_unlock_irq(&css_set_lock); > > if (len >= PATH_MAX) > len = -ERANGE; Cheers, Longman
Hello, On Tue, May 02, 2023 at 03:38:47PM +0200, Michal Koutný wrote: > /proc/$pid/mountinfo may accumulate lots of entries (causing frequent > re-reads of whole file) or lots cgroupfs entries alone. > The cgroupfs entries rendered with cgroup_show_path() may increase/be > subject of css_set_lock contention causing further slowdown -- not only > mountinfo rendering but any other css_set_lock user. > > We leverage the fact that mountinfo reading happens with namespace_sem > taken and hierarchy roots thus cannot be freed concurrently. > > There are three relevant nodes for each cgroupfs entry: > > R ... cgroup hierarchy root > M ... mount root > C ... reader's cgroup NS root > > mountinfo is supposed to show path from C to M. At least for cgroup2, the path from C to M isn't gonna change once NS is established, right? Nothing can be moved or renamed while the NS root is there. If so, can't we just cache the formatted path and return the same thing without any locking? The proposed changes seem a bit too brittle to me. Thanks.
On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote: > > There are three relevant nodes for each cgroupfs entry: > > > > R ... cgroup hierarchy root > > M ... mount root > > C ... reader's cgroup NS root > > > > mountinfo is supposed to show path from C to M. > > At least for cgroup2, the path from C to M isn't gonna change once NS is > established, right? Right. Although, the argument about M (when C above M or when C and M in different subtrees) implicitly relies on the namespace_sem. > Nothing can be moved or renamed while the NS root is there. If so, > can't we just cache the formatted path and return the same thing > without any locking? Here I find the caching complexity not worth it for v2 only (also C is in the eye of the beholder) -- because then css_set_lock can be dropped from cgroup_show_path with simpler reasoning. (Sadly, the bigger benefit would be on hybrid setups multiplied by the number of v1 hierarchies listed.) (OTOH, the caching argument and weights for it may be different for /proc/$pid/cgroup.) > The proposed changes seem a bit too brittle to me. OK, I'll look into separate cgroup_show_path and cgroup1_show_path approach. Thanks, Michal
Hello, On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote: > On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote: > > > There are three relevant nodes for each cgroupfs entry: > > > > > > R ... cgroup hierarchy root > > > M ... mount root > > > C ... reader's cgroup NS root > > > > > > mountinfo is supposed to show path from C to M. > > > > At least for cgroup2, the path from C to M isn't gonna change once NS is > > established, right? > > Right. Although, the argument about M (when C above M or when C and M in > different subtrees) implicitly relies on the namespace_sem. I don't follow. Can you please elaborate a bit more? Thanks.
On Fri, May 05, 2023 at 08:17:10AM -1000, Tejun Heo <tj@kernel.org> wrote: > On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote: > > On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote: > > > > There are three relevant nodes for each cgroupfs entry: > > > > > > > > R ... cgroup hierarchy root > > > > M ... mount root > > > > C ... reader's cgroup NS root > > > > > > > > mountinfo is supposed to show path from C to M. > > > > > > At least for cgroup2, the path from C to M isn't gonna change once NS is > > > established, right? > > > > Right. Although, the argument about M (when C above M or when C and M in > > different subtrees) implicitly relies on the namespace_sem. > > I don't follow. Can you please elaborate a bit more? I wanted to say that even with restriction to cgroup2, the css_set_lock removal would also rely on namespace_sem. For a given mountinfo entry the path C--M won't change (no renames). The question is whether cgroup M will stay around (with the relaxed locking): - C >= M (C is below M) -> C (transitively) pins M - C < M (C is above M) or C and M are in two disjoint subtrees (path goes through a common ancestor) -> M could be released without relation to C (even on cgroup2, with the css_set_lock removed) but such a destructive operation on M is excluded as long as namespace_sem is held during entry rendering. Does that clarify the trade-off of removing css_set_lock at this spot? Thanks, Michal
Hello, Michal. Sorry about the delay. On Tue, May 09, 2023 at 12:34:53PM +0200, Michal Koutný wrote: > On Fri, May 05, 2023 at 08:17:10AM -1000, Tejun Heo <tj@kernel.org> wrote: > > On Fri, May 05, 2023 at 07:32:40PM +0200, Michal Koutný wrote: > > > On Fri, May 05, 2023 at 05:45:58AM -1000, Tejun Heo <tj@kernel.org> wrote: > > > > > There are three relevant nodes for each cgroupfs entry: > > > > > > > > > > R ... cgroup hierarchy root > > > > > M ... mount root > > > > > C ... reader's cgroup NS root > > > > > > > > > > mountinfo is supposed to show path from C to M. > > > > > > > > At least for cgroup2, the path from C to M isn't gonna change once NS is > > > > established, right? > > > > > > Right. Although, the argument about M (when C above M or when C and M in > > > different subtrees) implicitly relies on the namespace_sem. > > > > I don't follow. Can you please elaborate a bit more? > > I wanted to say that even with restriction to cgroup2, the css_set_lock > removal would also rely on namespace_sem. > > For a given mountinfo entry the path C--M won't change (no renames). > The question is whether cgroup M will stay around (with the relaxed > locking): > > - C >= M (C is below M) > -> C (transitively) pins M Yeah, this was what I was thinking. > - C < M (C is above M) or C and M are in two disjoint subtrees (path > goes through a common ancestor) > -> M could be released without relation to C (even on cgroup2, with > the css_set_lock removed) but such a destructive operation on M > is excluded as long as namespace_sem is held during entry > rendering. > > Does that clarify the trade-off of removing css_set_lock at this spot? Right, you can have cgroup outside NS root still mounted and that mount root can be viewed from multiple cgroup NS's, so the the path isn't fixed either. Having enough lockdep annotations should do but if reasonable the preference is being a bit more self-contained. Thanks for the explanation.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 32d693a797b9..e2ec6f0028be 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1407,12 +1407,18 @@ static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset, struct cgroup *res_cgroup = NULL; if (cset == &init_css_set) { + /* callers must ensure root stability */ res_cgroup = &root->cgrp; } else if (root == &cgrp_dfl_root) { res_cgroup = cset->dfl_cgrp; } else { struct cgrp_cset_link *link; - lockdep_assert_held(&css_set_lock); + /* cset's cgroups are pinned unless they are root cgroups that + * were unmounted. We look at links to !cgrp_dfl_root + * cgroup_root, either lock ensures the list is not mutated + */ + lockdep_assert(lockdep_is_held(&css_set_lock) || + lockdep_is_held_type(&namespace_sem, -1)); list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { struct cgroup *c = link->cgrp; @@ -1438,8 +1444,6 @@ current_cgns_cgroup_from_root(struct cgroup_root *root) struct cgroup *res = NULL; struct css_set *cset; - lockdep_assert_held(&css_set_lock); - /* namespace_sem ensures `root` stability on unmount */ lockdep_assert(lockdep_is_held_type(&namespace_sem, -1)); @@ -1905,10 +1909,8 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, if (!buf) return -ENOMEM; - spin_lock_irq(&css_set_lock); ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot); len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX); - spin_unlock_irq(&css_set_lock); if (len >= PATH_MAX) len = -ERANGE;
/proc/$pid/mountinfo may accumulate lots of entries (causing frequent re-reads of whole file) or lots cgroupfs entries alone. The cgroupfs entries rendered with cgroup_show_path() may increase/be subject of css_set_lock contention causing further slowdown -- not only mountinfo rendering but any other css_set_lock user. We leverage the fact that mountinfo reading happens with namespace_sem taken and hierarchy roots thus cannot be freed concurrently. There are three relevant nodes for each cgroupfs entry: R ... cgroup hierarchy root M ... mount root C ... reader's cgroup NS root mountinfo is supposed to show path from C to M. Current's css_set (and linked root cgroups) are stable under namespace_sem, therefore current_cgns_cgroup_from_root() doesn't need css_set_lock. When the path is assembled in kernfs_path_from_node(), we know that: - C kernfs_node is (transitively) pinned via current->nsproxy, - M kernfs_node is pinned thanks to namespace_sem, - path C to M is pinned via child->parent references (this holds also when C and M are in distinct subtrees). Streamline mountinfo rendering a bit by relieving css_set_lock and add careful notes about that. Signed-off-by: Michal Koutný <mkoutny@suse.com> --- kernel/cgroup/cgroup.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)