diff mbox series

[v2,3/9] mm: vmscan: guarantee shrinker_slab_memcg() sees valid shrinker_maps for online memcg

Message ID 20201214223722.232537-4-shy828301@gmail.com (mailing list archive)
State New, archived
Headers show
Series Make shrinker's nr_deferred memcg aware | expand

Commit Message

Yang Shi Dec. 14, 2020, 10:37 p.m. UTC
The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of processor reordering
on !x86.

This seems like the below case:

           CPU A          CPU B
store shrinker_map      load CSS_ONLINE
store CSS_ONLINE        load shrinker_map

So the memory ordering could be guaranteed by smp_wmb()/smp_rmb() pair.

The memory barriers pair will guarantee the ordering between shrinker_deferred and CSS_ONLINE
for the following patches as well.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memcontrol.c | 7 +++++++
 mm/vmscan.c     | 8 +++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

Comments

Dave Chinner Dec. 15, 2020, 2:04 a.m. UTC | #1
On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
> The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
> memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of processor reordering
> on !x86.
> 
> This seems like the below case:
> 
>            CPU A          CPU B
> store shrinker_map      load CSS_ONLINE
> store CSS_ONLINE        load shrinker_map
> 
> So the memory ordering could be guaranteed by smp_wmb()/smp_rmb() pair.
> 
> The memory barriers pair will guarantee the ordering between shrinker_deferred and CSS_ONLINE
> for the following patches as well.

This should not require memory barriers in the shrinker code.

The code that sets and checks the CSS_ONLINE flag should have the
memory barriers to ensure that anything that sees an online CSS will
see it completely set up.

That is, the functions online_css() that set the CSS_ONLINE needs
a memory barrier to ensure all previous writes are completed before
the CSS_ONLINE flag is set, and the function mem_cgroup_online()
needs a barrier to pair with that.

This is the same existence issue that the superblock shrinkers have
with the shrinkers being registered before the superblock is fully
set up. The SB_BORN flag on the sueprblock indicates the superblock
is now fully set up ("online" in CSS speak) and the registered
shrinker can run. Please see the smp_wmb() before we set SB_BORN in
vfs_get_tree(), and the big comment about the smp_rmb() -after- we
check SB_BORN in super_cache_count() to understand the details of
the data dependency between the flag and the structures being set up
that the barriers enforce.

IOWs, these memory barriers belong inside the cgroup code to
guarantee anything that sees an online cgroup will always see the
fully initialised cgroup structures. They do not belong in the
shrinker infrastructure...

Cheers,

Dave.
Kirill Tkhai Dec. 15, 2020, 12:58 p.m. UTC | #2
15.12.2020, 15:40, "Johannes Weiner" <hannes@cmpxchg.org>:
> On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
>>  The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
>>  in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
>>  memcg->nodeinfo[nid]->shrinker_maps != NULL. This may occur because of processor reordering
>>  on !x86.
>>
>>  This seems like the below case:
>>
>>             CPU A CPU B
>>  store shrinker_map load CSS_ONLINE
>>  store CSS_ONLINE load shrinker_map
>>
>>  So the memory ordering could be guaranteed by smp_wmb()/smp_rmb() pair.
>>
>>  The memory barriers pair will guarantee the ordering between shrinker_deferred and CSS_ONLINE
>>  for the following patches as well.
>>
>>  Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> As per previous feedback, please move the misplaced shrinker
> allocation callback from .css_online to .css_alloc. This will get you
> the necessary ordering guarantees from the cgroup core code.


Can you read my emails from ktkhai@virtuozzo.com? I've already answered
on this question here: https://lkml.org/lkml/2020/12/10/726

Check your spam folder, and add my address to allow-list if so.
Johannes Weiner Dec. 15, 2020, 5:14 p.m. UTC | #3
On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
> The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
> memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of processor reordering
> on !x86.
> 
> This seems like the below case:
> 
>            CPU A          CPU B
> store shrinker_map      load CSS_ONLINE
> store CSS_ONLINE        load shrinker_map

But we have a separate check on shrinker_maps, so it doesn't matter
that it isn't guaranteed, no?

The only downside I can see is when CSS_ONLINE isn't visible yet and
we bail even though we'd be ready to shrink. Although it's probably
unlikely that there would be any objects allocated already...

Can somebody remind me why we check mem_cgroup_online() at all?

If shrinker_map is set, we can shrink: .css_alloc is guaranteed to be
complete, and by using RCU for the shrinker_map pointer, the map is
also guaranteed to be initialized. There is nothing else happening
during onlining that you may depend on.

If shrinker_map isn't set, we cannot iterate the bitmap. It does not
really matter whether CSS_ONLINE is reordered and visible already.

Agreed with Dave: if we need that synchronization around onlining, it
needs to happen inside the cgroup core. But I wouldn't add that until
somebody actually required it.
Yang Shi Dec. 28, 2020, 8:03 p.m. UTC | #4
I think Johannes's point makes sense to me. If the shrinker_maps is
not initialized yet it means the memcg is too young to have a number
of reclaimable slab caches. It sounds fine to just skip it.

And, with consolidating shrinker_maps and shrinker_deferred into one
struct, we could just check the pointer of the struct. So, it seems
this patch is not necessary anymore. This patch will be dropped in v3.

On Tue, Dec 15, 2020 at 12:31 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Tue, Dec 15, 2020 at 9:16 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Mon, Dec 14, 2020 at 02:37:16PM -0800, Yang Shi wrote:
> > > The shrink_slab_memcg() races with mem_cgroup_css_online(). A visibility of CSS_ONLINE flag
> > > in shrink_slab_memcg()->mem_cgroup_online() does not guarantee that we will see
> > > memcg->nodeinfo[nid]->shrinker_maps != NULL.  This may occur because of processor reordering
> > > on !x86.
> > >
> > > This seems like the below case:
> > >
> > >            CPU A          CPU B
> > > store shrinker_map      load CSS_ONLINE
> > > store CSS_ONLINE        load shrinker_map
> >
> > But we have a separate check on shrinker_maps, so it doesn't matter
> > that it isn't guaranteed, no?
>
> IIUC, yes. Checking shrinker_maps is the alternative way to detect the
> reordering to prevent from seeing NULL shrinker_maps per Kirill.
>
> We could check shrinker_deferred too, then just walk away if it is NULL.
>
> >
> > The only downside I can see is when CSS_ONLINE isn't visible yet and
> > we bail even though we'd be ready to shrink. Although it's probably
> > unlikely that there would be any objects allocated already...
>
> Yes, it seems so.
>
> >
> > Can somebody remind me why we check mem_cgroup_online() at all?
>
> IIUC it should be mainly used to skip offlined memcgs since there is
> nothing on offlined memcgs' LRU because all objects have been
> reparented. But shrinker_map won't be freed until .css_free is called.
> So the shrinkers might be called in vain.
>
> >
> > If shrinker_map is set, we can shrink: .css_alloc is guaranteed to be
> > complete, and by using RCU for the shrinker_map pointer, the map is
> > also guaranteed to be initialized. There is nothing else happening
> > during onlining that you may depend on.
> >
> > If shrinker_map isn't set, we cannot iterate the bitmap. It does not
> > really matter whether CSS_ONLINE is reordered and visible already.
>
> As I mentioned above it should be used to skip offlined memcgs, but it
> also opens the race condition due to memory reordering. As Kirill
> explained in the earlier email, we could either check the pointer or
> use memory barriers.
>
> If the memory barriers seems overkilling, I could definitely switch
> back to NULL pointer check approach.
>
> >
> > Agreed with Dave: if we need that synchronization around onlining, it
> > needs to happen inside the cgroup core. But I wouldn't add that until
> > somebody actually required it.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ed942734235f..3d4ddbb84a01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5406,6 +5406,13 @@  static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		return -ENOMEM;
 	}
 
+	/*
+	 * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees shirnker_maps
+	 * and shrinker_deferred before CSS_ONLINE. It pairs with the read barrier
+	 * in shrink_slab_memcg().
+	 */
+	smp_wmb();
+
 	/* Online state pins memcg ID, memcg ID pins CSS */
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 912c044301dd..9b31b9c419ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -552,13 +552,15 @@  static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 	if (!mem_cgroup_online(memcg))
 		return 0;
 
+	/* Pairs with write barrier in mem_cgroup_css_online() */
+	smp_rmb();
+
 	if (!down_read_trylock(&shrinker_rwsem))
 		return 0;
 
+	/* Once memcg is online it can't be NULL */
 	map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
 					true);
-	if (unlikely(!map))
-		goto unlock;
 
 	for_each_set_bit(i, map->map, shrinker_nr_max) {
 		struct shrink_control sc = {
@@ -612,7 +614,7 @@  static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 			break;
 		}
 	}
-unlock:
+
 	up_read(&shrinker_rwsem);
 	return freed;
 }