diff mbox series

[mm-unstable,v1,1/4] mm: don't hold css->refcnt during traversal

Message ID 20240724190214.1108049-2-kinseyho@google.com (mailing list archive)
State New
Headers show
Series Improve mem_cgroup_iter() | expand

Commit Message

Kinsey Ho July 24, 2024, 7:02 p.m. UTC
To obtain the pointer to the saved memcg position, mem_cgroup_iter()
currently holds css->refcnt during memcg traversal only to put
css->refcnt at the end of the routine. This isn't necessary as an
rcu_read_lock is already held throughout the function.

Remove css->refcnt usage during traversal by leveraging RCU.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 include/linux/memcontrol.h |  2 +-
 mm/memcontrol.c            | 18 +-----------------
 2 files changed, 2 insertions(+), 18 deletions(-)

Comments

kernel test robot July 25, 2024, 4:33 p.m. UTC | #1
Hi Kinsey,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.10 next-20240725]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kinsey-Ho/mm-don-t-hold-css-refcnt-during-traversal/20240725-030750
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240724190214.1108049-2-kinseyho%40google.com
patch subject: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
config: x86_64-randconfig-121-20240725 (https://download.01.org/0day-ci/archive/20240726/202407260047.sIuRFLJE-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240726/202407260047.sIuRFLJE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407260047.sIuRFLJE-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/memcontrol.c:1063:23: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__old @@     got struct mem_cgroup *[assigned] pos @@
   mm/memcontrol.c:1063:23: sparse:     expected struct mem_cgroup [noderef] __rcu *__old
   mm/memcontrol.c:1063:23: sparse:     got struct mem_cgroup *[assigned] pos
   mm/memcontrol.c:1063:23: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__new @@     got struct mem_cgroup *[assigned] memcg @@
   mm/memcontrol.c:1063:23: sparse:     expected struct mem_cgroup [noderef] __rcu *__new
   mm/memcontrol.c:1063:23: sparse:     got struct mem_cgroup *[assigned] memcg
>> mm/memcontrol.c:1101:17: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__old @@     got struct mem_cgroup *dead_memcg @@
   mm/memcontrol.c:1101:17: sparse:     expected struct mem_cgroup [noderef] __rcu *__old
   mm/memcontrol.c:1101:17: sparse:     got struct mem_cgroup *dead_memcg
   mm/memcontrol.c: note: in included file (through include/linux/cgroup-defs.h):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/memcontrol.c: note: in included file:
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock' - wrong count at exit
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irq' - wrong count at exit
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irqsave' - wrong count at exit
   mm/memcontrol.c: note: in included file (through include/linux/cgroup-defs.h):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true

vim +1063 mm/memcontrol.c

4b569387c0d566 Nhat Pham         2023-10-06   974  
5660048ccac873 Johannes Weiner   2012-01-12   975  /**
5660048ccac873 Johannes Weiner   2012-01-12   976   * mem_cgroup_iter - iterate over memory cgroup hierarchy
5660048ccac873 Johannes Weiner   2012-01-12   977   * @root: hierarchy root
5660048ccac873 Johannes Weiner   2012-01-12   978   * @prev: previously returned memcg, NULL on first invocation
5660048ccac873 Johannes Weiner   2012-01-12   979   * @reclaim: cookie for shared reclaim walks, NULL for full walks
5660048ccac873 Johannes Weiner   2012-01-12   980   *
5660048ccac873 Johannes Weiner   2012-01-12   981   * Returns references to children of the hierarchy below @root, or
5660048ccac873 Johannes Weiner   2012-01-12   982   * @root itself, or %NULL after a full round-trip.
5660048ccac873 Johannes Weiner   2012-01-12   983   *
5660048ccac873 Johannes Weiner   2012-01-12   984   * Caller must pass the return value in @prev on subsequent
5660048ccac873 Johannes Weiner   2012-01-12   985   * invocations for reference counting, or use mem_cgroup_iter_break()
5660048ccac873 Johannes Weiner   2012-01-12   986   * to cancel a hierarchy walk before the round-trip is complete.
5660048ccac873 Johannes Weiner   2012-01-12   987   *
05bdc520b3ad39 Miaohe Lin        2020-10-13   988   * Reclaimers can specify a node in @reclaim to divide up the memcgs
05bdc520b3ad39 Miaohe Lin        2020-10-13   989   * in the hierarchy among all concurrent reclaimers operating on the
05bdc520b3ad39 Miaohe Lin        2020-10-13   990   * same node.
5660048ccac873 Johannes Weiner   2012-01-12   991   */
694fbc0fe78518 Andrew Morton     2013-09-24   992  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
9f3a0d0933de07 Johannes Weiner   2012-01-12   993  				   struct mem_cgroup *prev,
694fbc0fe78518 Andrew Morton     2013-09-24   994  				   struct mem_cgroup_reclaim_cookie *reclaim)
7d74b06f240f1b KAMEZAWA Hiroyuki 2010-10-27   995  {
3f649ab728cda8 Kees Cook         2020-06-03   996  	struct mem_cgroup_reclaim_iter *iter;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10   997  	struct cgroup_subsys_state *css = NULL;
9f3a0d0933de07 Johannes Weiner   2012-01-12   998  	struct mem_cgroup *memcg = NULL;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10   999  	struct mem_cgroup *pos = NULL;
711d3d2c9bc3fb KAMEZAWA Hiroyuki 2010-10-27  1000  
694fbc0fe78518 Andrew Morton     2013-09-24  1001  	if (mem_cgroup_disabled())
694fbc0fe78518 Andrew Morton     2013-09-24  1002  		return NULL;
5660048ccac873 Johannes Weiner   2012-01-12  1003  
9f3a0d0933de07 Johannes Weiner   2012-01-12  1004  	if (!root)
9f3a0d0933de07 Johannes Weiner   2012-01-12  1005  		root = root_mem_cgroup;
9f3a0d0933de07 Johannes Weiner   2012-01-12  1006  
542f85f9ae4acd Michal Hocko      2013-04-29  1007  	rcu_read_lock();
14067bb3e24b96 KAMEZAWA Hiroyuki 2009-04-02  1008  
527a5ec9a53471 Johannes Weiner   2012-01-12  1009  	if (reclaim) {
ef8f2327996b5c Mel Gorman        2016-07-28  1010  		struct mem_cgroup_per_node *mz;
527a5ec9a53471 Johannes Weiner   2012-01-12  1011  
a3747b53b1771a Johannes Weiner   2021-04-29  1012  		mz = root->nodeinfo[reclaim->pgdat->node_id];
9da83f3fc74b80 Yafang Shao       2019-11-30  1013  		iter = &mz->iter;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1014  
a9320aae68a1cd Wei Yang          2022-04-28  1015  		/*
a9320aae68a1cd Wei Yang          2022-04-28  1016  		 * On start, join the current reclaim iteration cycle.
a9320aae68a1cd Wei Yang          2022-04-28  1017  		 * Exit when a concurrent walker completes it.
a9320aae68a1cd Wei Yang          2022-04-28  1018  		 */
a9320aae68a1cd Wei Yang          2022-04-28  1019  		if (!prev)
a9320aae68a1cd Wei Yang          2022-04-28  1020  			reclaim->generation = iter->generation;
a9320aae68a1cd Wei Yang          2022-04-28  1021  		else if (reclaim->generation != iter->generation)
542f85f9ae4acd Michal Hocko      2013-04-29  1022  			goto out_unlock;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1023  
fe6faac115aa89 Kinsey Ho         2024-07-24  1024  		pos = rcu_dereference(iter->position);
89d8330ccf2ad4 Wei Yang          2022-04-28  1025  	} else if (prev) {
89d8330ccf2ad4 Wei Yang          2022-04-28  1026  		pos = prev;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1027  	}
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1028  
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1029  	if (pos)
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1030  		css = &pos->css;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1031  
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1032  	for (;;) {
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1033  		css = css_next_descendant_pre(css, &root->css);
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1034  		if (!css) {
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1035  			/*
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1036  			 * Reclaimers share the hierarchy walk, and a
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1037  			 * new one might jump in right at the end of
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1038  			 * the hierarchy - make sure they see at least
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1039  			 * one group and restart from the beginning.
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1040  			 */
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1041  			if (!prev)
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1042  				continue;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1043  			break;
542f85f9ae4acd Michal Hocko      2013-04-29  1044  		}
5f578161971863 Michal Hocko      2013-04-29  1045  
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1046  		/*
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1047  		 * Verify the css and acquire a reference.  The root
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1048  		 * is provided by the caller, so we know it's alive
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1049  		 * and kicking, and don't take an extra reference.
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1050  		 */
41555dadbff8d2 Wei Yang          2022-04-28  1051  		if (css == &root->css || css_tryget(css)) {
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1052  			memcg = mem_cgroup_from_css(css);
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1053  			break;
41555dadbff8d2 Wei Yang          2022-04-28  1054  		}
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1055  	}
9f3a0d0933de07 Johannes Weiner   2012-01-12  1056  
527a5ec9a53471 Johannes Weiner   2012-01-12  1057  	if (reclaim) {
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1058  		/*
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1059  		 * The position could have already been updated by a competing
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1060  		 * thread, so check that the value hasn't changed since we read
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1061  		 * it to avoid reclaiming from the same cgroup twice.
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1062  		 */
6df38689e0e9a0 Vladimir Davydov  2015-12-29 @1063  		(void)cmpxchg(&iter->position, pos, memcg);
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1064  
19f39402864ea3 Michal Hocko      2013-04-29  1065  		if (!memcg)
527a5ec9a53471 Johannes Weiner   2012-01-12  1066  			iter->generation++;
527a5ec9a53471 Johannes Weiner   2012-01-12  1067  	}
14067bb3e24b96 KAMEZAWA Hiroyuki 2009-04-02  1068  
542f85f9ae4acd Michal Hocko      2013-04-29  1069  out_unlock:
542f85f9ae4acd Michal Hocko      2013-04-29  1070  	rcu_read_unlock();
c40046f3ad5e87 Michal Hocko      2013-04-29  1071  	if (prev && prev != root)
c40046f3ad5e87 Michal Hocko      2013-04-29  1072  		css_put(&prev->css);
c40046f3ad5e87 Michal Hocko      2013-04-29  1073  
9f3a0d0933de07 Johannes Weiner   2012-01-12  1074  	return memcg;
9f3a0d0933de07 Johannes Weiner   2012-01-12  1075  }
14067bb3e24b96 KAMEZAWA Hiroyuki 2009-04-02  1076  
5660048ccac873 Johannes Weiner   2012-01-12  1077  /**
5660048ccac873 Johannes Weiner   2012-01-12  1078   * mem_cgroup_iter_break - abort a hierarchy walk prematurely
5660048ccac873 Johannes Weiner   2012-01-12  1079   * @root: hierarchy root
5660048ccac873 Johannes Weiner   2012-01-12  1080   * @prev: last visited hierarchy member as returned by mem_cgroup_iter()
5660048ccac873 Johannes Weiner   2012-01-12  1081   */
5660048ccac873 Johannes Weiner   2012-01-12  1082  void mem_cgroup_iter_break(struct mem_cgroup *root,
9f3a0d0933de07 Johannes Weiner   2012-01-12  1083  			   struct mem_cgroup *prev)
9f3a0d0933de07 Johannes Weiner   2012-01-12  1084  {
711d3d2c9bc3fb KAMEZAWA Hiroyuki 2010-10-27  1085  	if (!root)
711d3d2c9bc3fb KAMEZAWA Hiroyuki 2010-10-27  1086  		root = root_mem_cgroup;
9f3a0d0933de07 Johannes Weiner   2012-01-12  1087  	if (prev && prev != root)
9f3a0d0933de07 Johannes Weiner   2012-01-12  1088  		css_put(&prev->css);
14067bb3e24b96 KAMEZAWA Hiroyuki 2009-04-02  1089  }
9f3a0d0933de07 Johannes Weiner   2012-01-12  1090  
54a83d6bcbf8f4 Miles Chen        2019-08-13  1091  static void __invalidate_reclaim_iterators(struct mem_cgroup *from,
54a83d6bcbf8f4 Miles Chen        2019-08-13  1092  					struct mem_cgroup *dead_memcg)
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1093  {
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1094  	struct mem_cgroup_reclaim_iter *iter;
ef8f2327996b5c Mel Gorman        2016-07-28  1095  	struct mem_cgroup_per_node *mz;
ef8f2327996b5c Mel Gorman        2016-07-28  1096  	int nid;
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1097  
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1098  	for_each_node(nid) {
a3747b53b1771a Johannes Weiner   2021-04-29  1099  		mz = from->nodeinfo[nid];
9da83f3fc74b80 Yafang Shao       2019-11-30  1100  		iter = &mz->iter;
9da83f3fc74b80 Yafang Shao       2019-11-30 @1101  		cmpxchg(&iter->position, dead_memcg, NULL);
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1102  	}
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1103  }
54a83d6bcbf8f4 Miles Chen        2019-08-13  1104
Johannes Weiner July 25, 2024, 8:43 p.m. UTC | #2
On Wed, Jul 24, 2024 at 07:02:11PM +0000, Kinsey Ho wrote:
> To obtain the pointer to the saved memcg position, mem_cgroup_iter()
> currently holds css->refcnt during memcg traversal only to put
> css->refcnt at the end of the routine. This isn't necessary as an
> rcu_read_lock is already held throughout the function.
> 
> Remove css->refcnt usage during traversal by leveraging RCU.

Eh, I don't know about this.

RCU ensures that the css memory isn't freed.

The tryget ensures that the css is still alive and valid.

In this case, it just so happens that the sibling linkage is also rcu
protected. But accessing random css members when the refcount is 0 is
kind of sketchy. On the other hand, the refcount is guaranteed to be
valid, and rcu + tryget is a common pattern.

What does this buy us? The tryget is cheap.
Roman Gushchin July 25, 2024, 9:09 p.m. UTC | #3
On Thu, Jul 25, 2024 at 04:43:46PM -0400, Johannes Weiner wrote:
> On Wed, Jul 24, 2024 at 07:02:11PM +0000, Kinsey Ho wrote:
> > To obtain the pointer to the saved memcg position, mem_cgroup_iter()
> > currently holds css->refcnt during memcg traversal only to put
> > css->refcnt at the end of the routine. This isn't necessary as an
> > rcu_read_lock is already held throughout the function.
> > 
> > Remove css->refcnt usage during traversal by leveraging RCU.
> 
> Eh, I don't know about this.
> 
> RCU ensures that the css memory isn't freed.
> 
> The tryget ensures that the css is still alive and valid.
> 
> In this case, it just so happens that the sibling linkage is also rcu
> protected. But accessing random css members when the refcount is 0 is
> kind of sketchy. On the other hand, the refcount is guaranteed to be
> valid, and rcu + tryget is a common pattern.

I also spent quite some time thinking about potential bad consequences,
but _it seems_ to be safe (but I agree it feels dangerous).

> 
> What does this buy us? The tryget is cheap.

To be fair, tryget is not always cheap. Offline/dying cgroups have an atomic
operation there.
Yosry Ahmed July 25, 2024, 10:33 p.m. UTC | #4
On Thu, Jul 25, 2024 at 1:43 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jul 24, 2024 at 07:02:11PM +0000, Kinsey Ho wrote:
> > To obtain the pointer to the saved memcg position, mem_cgroup_iter()
> > currently holds css->refcnt during memcg traversal only to put
> > css->refcnt at the end of the routine. This isn't necessary as an
> > rcu_read_lock is already held throughout the function.
> >
> > Remove css->refcnt usage during traversal by leveraging RCU.
>
> Eh, I don't know about this.
>
> RCU ensures that the css memory isn't freed.
>
> The tryget ensures that the css is still alive and valid.
>
> In this case, it just so happens that the sibling linkage is also rcu
> protected. But accessing random css members when the refcount is 0 is
> kind of sketchy. On the other hand, the refcount is guaranteed to be
> valid, and rcu + tryget is a common pattern.

To be fair, the documentation of css_next_descendant_pre() mentions
that the requirements are:
- Either cgroup_mutex or RCU lock is held.
- Both @pos and @root are accessible.
- @pos is a descendant of @root.

This reads to me like it is intentional that RCU protection is enough
for @pos and @root, and that the sibling linkage is RCU protected by
design. Perhaps we could clarify this further (whether at
css_next_descendant_pre(), or above the definition of the linkage
members).

>
> What does this buy us? The tryget is cheap.

mem_cgroup_iter() is not an easy function to follow, so I personally
appreciate the simplicity gains tbh.
Kinsey Ho Aug. 1, 2024, 9:46 p.m. UTC | #5
Thank you Johannes, Roman, and Yosry for reviewing this patch!

On Thu, Jul 25, 2024 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> On Thu, Jul 25, 2024 at 1:43 PM Johannes Weiner <hannes@cmpxchg.org>
wrote:
> > What does this buy us? The tryget is cheap.
>
> mem_cgroup_iter() is not an easy function to follow, so I personally
> appreciate the simplicity gains tbh.

Yes, the main intention here was to simplify the code’s readability.

> This reads to me like it is intentional that RCU protection is enough
> for @pos and @root, and that the sibling linkage is RCU protected by
> design. Perhaps we could clarify this further (whether at
> css_next_descendant_pre(), or above the definition of the linkage
> members).

Do we want to move forward with Yosry’s suggestion to clarify that the
sibling linkage is RCU-protected by design? Perhaps this clarification can
be made in the definition of the linkage members so that the safety of the
css in this function is more clear to users. If this is sufficient, I will
make the change in a v2 patchset.
Kinsey Ho Aug. 1, 2024, 10:32 p.m. UTC | #6
Sorry, I replied to this email earlier but it had some issues with plain
text. Please ignore the first reply of mine (the one with HTML). I'm resending
the email below.

Thank you Johannes, Roman, and Yosry for reviewing this patch!

On Thu, Jul 25, 2024 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> On Thu, Jul 25, 2024 at 1:43 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > What does this buy us? The tryget is cheap.
>
> mem_cgroup_iter() is not an easy function to follow, so I personally
> appreciate the simplicity gains tbh.

Yes, the main intention here was to simplify the code's readability.

> This reads to me like it is intentional that RCU protection is enough
> for @pos and @root, and that the sibling linkage is RCU protected by
> design. Perhaps we could clarify this further (whether at
> css_next_descendant_pre(), or above the definition of the linkage
> members).

Do we want to move forward with Yosry's suggestion to clarify that the
sibling linkage is RCU-protected by design? Perhaps this clarification
can be made in the definition of the linkage members so that the
safety of the css in this function is more clear to users. If this is
sufficient, I will make the change in a v2 patchset.
Johannes Weiner Aug. 2, 2024, 1:28 a.m. UTC | #7
On Thu, Aug 01, 2024 at 03:32:53PM -0700, Kinsey Ho wrote:
> Sorry, I replied to this email earlier but it had some issues with plain
> text. Please ignore the first reply of mine (the one with HTML). I'm resending
> the email below.
> 
> Thank you Johannes, Roman, and Yosry for reviewing this patch!
> 
> On Thu, Jul 25, 2024 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > On Thu, Jul 25, 2024 at 1:43 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > What does this buy us? The tryget is cheap.
> >
> > mem_cgroup_iter() is not an easy function to follow, so I personally
> > appreciate the simplicity gains tbh.
> 
> Yes, the main intention here was to simplify the code's readability.
>
> > This reads to me like it is intentional that RCU protection is enough
> > for @pos and @root, and that the sibling linkage is RCU protected by
> > design. Perhaps we could clarify this further (whether at
> > css_next_descendant_pre(), or above the definition of the linkage
> > members).
> 
> Do we want to move forward with Yosry's suggestion to clarify that the
> sibling linkage is RCU-protected by design? Perhaps this clarification
> can be made in the definition of the linkage members so that the
> safety of the css in this function is more clear to users. If this is
> sufficient, I will make the change in a v2 patchset.

Yes, that sounds like a good way forward to me.
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7e2eb091049a..4cbab85e2e56 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -75,7 +75,7 @@  struct lruvec_stats_percpu;
 struct lruvec_stats;
 
 struct mem_cgroup_reclaim_iter {
-	struct mem_cgroup *position;
+	struct mem_cgroup __rcu *position;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 960371788687..062bfeee799c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1019,20 +1019,7 @@  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		else if (reclaim->generation != iter->generation)
 			goto out_unlock;
 
-		while (1) {
-			pos = READ_ONCE(iter->position);
-			if (!pos || css_tryget(&pos->css))
-				break;
-			/*
-			 * css reference reached zero, so iter->position will
-			 * be cleared by ->css_released. However, we should not
-			 * rely on this happening soon, because ->css_released
-			 * is called from a work queue, and by busy-waiting we
-			 * might block it. So we clear iter->position right
-			 * away.
-			 */
-			(void)cmpxchg(&iter->position, pos, NULL);
-		}
+		pos = rcu_dereference(iter->position);
 	} else if (prev) {
 		pos = prev;
 	}
@@ -1073,9 +1060,6 @@  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 */
 		(void)cmpxchg(&iter->position, pos, memcg);
 
-		if (pos)
-			css_put(&pos->css);
-
 		if (!memcg)
 			iter->generation++;
 	}