diff mbox

[v8,05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

Message ID 153063056619.1818.12550500883688681076.stgit@localhost.localdomain
State New, archived
Headers show

Commit Message

Kirill Tkhai July 3, 2018, 3:09 p.m. UTC
Imagine a big node with many cpus, memory cgroups and containers.
Let we have 200 containers, every container has 10 mounts,
and 10 cgroups. All container tasks don't touch foreign
containers mounts. If there is intensive pages write,
and global reclaim happens, a writing task has to iterate
over all memcgs to shrink slab, before it's able to go
to shrink_page_list().

Iteration over all the memcg slabs is very expensive:
the task has to visit 200 * 10 = 2000 shrinkers
for every memcg, and since there are 2000 memcgs,
the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls
just to try to isolate SWAP_CLUSTER_MAX pages in one
of the actively writing memcg via shrink_page_list().
I've observed a node spending almost 100% in kernel,
making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg.
The size of the bitmap depends on bitmap_nr_ids, and during
memcg life it's maintained to be enough to fit bitmap_nr_ids
shrinkers. Every bit in the map is related to corresponding
shrinker id.

Next patches will maintain set bit only for really charged
memcg. This will allow shrink_slab() to increase its
performance in significant way. See the last patch for
the numbers.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/memcontrol.h |   14 +++++
 mm/memcontrol.c            |  120 ++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |    5 ++
 3 files changed, 139 insertions(+)

Comments

Andrew Morton July 3, 2018, 8:50 p.m. UTC | #1
On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:

> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
> 
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 4000000.
> 
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
> 
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
> 
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
> 
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  	if (id < 0)
>  		goto unlock;
>  
> +	if (memcg_expand_shrinker_maps(id)) {
> +		idr_remove(&shrinker_idr, id);
> +		goto unlock;
> +	}
> +
>  	if (id >= shrinker_nr_max)
>  		shrinker_nr_max = id + 1;
>  	shrinker->id = id;

This function ends up being a rather sad little thing.

: static int prealloc_memcg_shrinker(struct shrinker *shrinker)
: {
: 	int id, ret = -ENOMEM;
: 
: 	down_write(&shrinker_rwsem);
: 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
: 	if (id < 0)
: 		goto unlock;
: 
: 	if (memcg_expand_shrinker_maps(id)) {
: 		idr_remove(&shrinker_idr, id);
: 		goto unlock;
: 	}
: 
: 	if (id >= shrinker_nr_max)
: 		shrinker_nr_max = id + 1;
: 	shrinker->id = id;
: 	ret = 0;
: unlock:
: 	up_write(&shrinker_rwsem);
: 	return ret;
: }

- there's no need to call memcg_expand_shrinker_maps() unless id >=
  shrinker_nr_max so why not move the code and avoid calling
  memcg_expand_shrinker_maps() in most cases.

- why aren't we decreasing shrinker_nr_max in
  unregister_memcg_shrinker()?  That's easy to do, avoids pointless
  work in shrink_slab_memcg() and avoids memory waste in future
  prealloc_memcg_shrinker() calls.

  It should be possible to find the highest ID in an IDR tree with a
  straightforward descent of the underlying radix tree, but I doubt if
  that has been wired up.  Otherwise a simple loop in
  unregister_memcg_shrinker() would be needed.
Kirill Tkhai July 4, 2018, 3:51 p.m. UTC | #2
On 03.07.2018 23:50, Andrew Morton wrote:
> On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
>> Imagine a big node with many cpus, memory cgroups and containers.
>> Let we have 200 containers, every container has 10 mounts,
>> and 10 cgroups. All container tasks don't touch foreign
>> containers mounts. If there is intensive pages write,
>> and global reclaim happens, a writing task has to iterate
>> over all memcgs to shrink slab, before it's able to go
>> to shrink_page_list().
>>
>> Iteration over all the memcg slabs is very expensive:
>> the task has to visit 200 * 10 = 2000 shrinkers
>> for every memcg, and since there are 2000 memcgs,
>> the total calls are 2000 * 2000 = 4000000.
>>
>> So, the shrinker makes 4 million do_shrink_slab() calls
>> just to try to isolate SWAP_CLUSTER_MAX pages in one
>> of the actively writing memcg via shrink_page_list().
>> I've observed a node spending almost 100% in kernel,
>> making useless iteration over already shrinked slab.
>>
>> This patch adds bitmap of memcg-aware shrinkers to memcg.
>> The size of the bitmap depends on bitmap_nr_ids, and during
>> memcg life it's maintained to be enough to fit bitmap_nr_ids
>> shrinkers. Every bit in the map is related to corresponding
>> shrinker id.
>>
>> Next patches will maintain set bit only for really charged
>> memcg. This will allow shrink_slab() to increase its
>> performance in significant way. See the last patch for
>> the numbers.
>>
>> ...
>>
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>>  	if (id < 0)
>>  		goto unlock;
>>  
>> +	if (memcg_expand_shrinker_maps(id)) {
>> +		idr_remove(&shrinker_idr, id);
>> +		goto unlock;
>> +	}
>> +
>>  	if (id >= shrinker_nr_max)
>>  		shrinker_nr_max = id + 1;
>>  	shrinker->id = id;
> 
> This function ends up being a rather sad little thing.
> 
> : static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> : {
> : 	int id, ret = -ENOMEM;
> : 
> : 	down_write(&shrinker_rwsem);
> : 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> : 	if (id < 0)
> : 		goto unlock;
> : 
> : 	if (memcg_expand_shrinker_maps(id)) {
> : 		idr_remove(&shrinker_idr, id);
> : 		goto unlock;
> : 	}
> : 
> : 	if (id >= shrinker_nr_max)
> : 		shrinker_nr_max = id + 1;
> : 	shrinker->id = id;
> : 	ret = 0;
> : unlock:
> : 	up_write(&shrinker_rwsem);
> : 	return ret;
> : }
> 
> - there's no need to call memcg_expand_shrinker_maps() unless id >=
>   shrinker_nr_max so why not move the code and avoid calling
>   memcg_expand_shrinker_maps() in most cases.

OK

> - why aren't we decreasing shrinker_nr_max in
>   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
>   work in shrink_slab_memcg() and avoids memory waste in future
>   prealloc_memcg_shrinker() calls.

You sure, but there are some things. Initially I went in the same way
as memcg_nr_cache_ids is made and just took the same x2 arithmetic.
It never decreases, so it looked good to make shrinker maps like it.
It's the only reason, so, it should not be a problem to rework.

The only moment is Vladimir strongly recommends modularity, i.e.
to have memcg_shrinker_map_size and shrinker_nr_max as different variables.
After the rework we won't be able to have this anymore, since memcontrol.c
will have to know actual shrinker_nr_max value and it will have to be exported.

Could this be a problem?

>   It should be possible to find the highest ID in an IDR tree with a
>   straightforward descent of the underlying radix tree, but I doubt if
>   that has been wired up.  Otherwise a simple loop in
>   unregister_memcg_shrinker() would be needed.

Kirill
Andrew Morton July 5, 2018, 10:10 p.m. UTC | #3
On Wed, 4 Jul 2018 18:51:12 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:

> > - why aren't we decreasing shrinker_nr_max in
> >   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
> >   work in shrink_slab_memcg() and avoids memory waste in future
> >   prealloc_memcg_shrinker() calls.
> 
> You sure, but there are some things. Initially I went in the same way
> as memcg_nr_cache_ids is made and just took the same x2 arithmetic.
> It never decreases, so it looked good to make shrinker maps like it.
> It's the only reason, so, it should not be a problem to rework.
> 
> The only moment is Vladimir strongly recommends modularity, i.e.
> to have memcg_shrinker_map_size and shrinker_nr_max as different variables.

For what reasons?

> After the rework we won't be able to have this anymore, since memcontrol.c
> will have to know actual shrinker_nr_max value and it will have to be exported.
>
> Could this be a problem?

Vladimir?
Matthew Wilcox July 5, 2018, 10:50 p.m. UTC | #4
On Tue, Jul 03, 2018 at 01:50:00PM -0700, Andrew Morton wrote:
>   It should be possible to find the highest ID in an IDR tree with a
>   straightforward descent of the underlying radix tree, but I doubt if
>   that has been wired up.  Otherwise a simple loop in
>   unregister_memcg_shrinker() would be needed.

Feature request received.  I've actually implemented it for the XArray
already, but it should be easy to do for the IDR too.
Vladimir Davydov July 6, 2018, 5:30 p.m. UTC | #5
On Tue, Jul 03, 2018 at 01:50:00PM -0700, Andrew Morton wrote:
> On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
> > Imagine a big node with many cpus, memory cgroups and containers.
> > Let we have 200 containers, every container has 10 mounts,
> > and 10 cgroups. All container tasks don't touch foreign
> > containers mounts. If there is intensive pages write,
> > and global reclaim happens, a writing task has to iterate
> > over all memcgs to shrink slab, before it's able to go
> > to shrink_page_list().
> > 
> > Iteration over all the memcg slabs is very expensive:
> > the task has to visit 200 * 10 = 2000 shrinkers
> > for every memcg, and since there are 2000 memcgs,
> > the total calls are 2000 * 2000 = 4000000.
> > 
> > So, the shrinker makes 4 million do_shrink_slab() calls
> > just to try to isolate SWAP_CLUSTER_MAX pages in one
> > of the actively writing memcg via shrink_page_list().
> > I've observed a node spending almost 100% in kernel,
> > making useless iteration over already shrinked slab.
> > 
> > This patch adds bitmap of memcg-aware shrinkers to memcg.
> > The size of the bitmap depends on bitmap_nr_ids, and during
> > memcg life it's maintained to be enough to fit bitmap_nr_ids
> > shrinkers. Every bit in the map is related to corresponding
> > shrinker id.
> > 
> > Next patches will maintain set bit only for really charged
> > memcg. This will allow shrink_slab() to increase its
> > performance in significant way. See the last patch for
> > the numbers.
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  	if (id < 0)
> >  		goto unlock;
> >  
> > +	if (memcg_expand_shrinker_maps(id)) {
> > +		idr_remove(&shrinker_idr, id);
> > +		goto unlock;
> > +	}
> > +
> >  	if (id >= shrinker_nr_max)
> >  		shrinker_nr_max = id + 1;
> >  	shrinker->id = id;
> 
> This function ends up being a rather sad little thing.
> 
> : static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> : {
> : 	int id, ret = -ENOMEM;
> : 
> : 	down_write(&shrinker_rwsem);
> : 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> : 	if (id < 0)
> : 		goto unlock;
> : 
> : 	if (memcg_expand_shrinker_maps(id)) {
> : 		idr_remove(&shrinker_idr, id);
> : 		goto unlock;
> : 	}
> : 
> : 	if (id >= shrinker_nr_max)
> : 		shrinker_nr_max = id + 1;
> : 	shrinker->id = id;
> : 	ret = 0;
> : unlock:
> : 	up_write(&shrinker_rwsem);
> : 	return ret;
> : }
> 
> - there's no need to call memcg_expand_shrinker_maps() unless id >=
>   shrinker_nr_max so why not move the code and avoid calling
>   memcg_expand_shrinker_maps() in most cases.

memcg_expand_shrinker_maps will return immediately if per memcg shrinker
maps can accommodate the new id. Since prealloc_memcg_shrinker is
definitely not a hot path, I don't see any penalty in calling this
function on each prealloc_memcg_shrinker invocation.

> 
> - why aren't we decreasing shrinker_nr_max in
>   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
>   work in shrink_slab_memcg() and avoids memory waste in future
>   prealloc_memcg_shrinker() calls.

We can shrink the maps, but IMHO it isn't worth the complexity it would
introduce, because in my experience if a workload used N mount points
(containers, whatever) at some point of its lifetime, it is likely to
use the same amount in the future.

> 
>   It should be possible to find the highest ID in an IDR tree with a
>   straightforward descent of the underlying radix tree, but I doubt if
>   that has been wired up.  Otherwise a simple loop in
>   unregister_memcg_shrinker() would be needed.
> 
>
Vladimir Davydov July 6, 2018, 5:50 p.m. UTC | #6
On Thu, Jul 05, 2018 at 03:10:30PM -0700, Andrew Morton wrote:
> On Wed, 4 Jul 2018 18:51:12 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> 
> > > - why aren't we decreasing shrinker_nr_max in
> > >   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
> > >   work in shrink_slab_memcg() and avoids memory waste in future
> > >   prealloc_memcg_shrinker() calls.
> > 
> > You sure, but there are some things. Initially I went in the same way
> > as memcg_nr_cache_ids is made and just took the same x2 arithmetic.
> > It never decreases, so it looked good to make shrinker maps like it.
> > It's the only reason, so, it should not be a problem to rework.
> > 
> > The only moment is Vladimir strongly recommends modularity, i.e.
> > to have memcg_shrinker_map_size and shrinker_nr_max as different variables.
> 
> For what reasons?

Having the only global variable updated in vmscan.c and used in
memcontrol.c or vice versa didn't look good to me. So I suggested to
introduce two separate static variables: one for max shrinker id (local
to vmscan.c) and another for max allocated size of per memcg shrinker
maps (local to memcontrol.c). Having the two variables instead of one
allows us to define a clear API between memcontrol.c and shrinker
infrastructure without sharing variables, which makes the code easier
to follow IMHO.

It is also more flexible: with the two variables we can decrease
shrinker_id_max when a shrinker is destroyed so that we can speed up
shrink_slab - this is fairly easy to do - but leave per memcg shrinker
maps the same size, because shrinking them is rather complicated and
doesn't seem to be worthwhile - the workload is likely to use the same
amount of memory again in the future.

> 
> > After the rework we won't be able to have this anymore, since memcontrol.c
> > will have to know actual shrinker_nr_max value and it will have to be exported.

Not necessarily. You can pass max shrinker id instead of the new id to
memcontrol.c in function arguments. But as I said before, I really don't
think that shrinking per memcg maps would make much sense.

> >
> > Could this be a problem?
diff mbox

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 62309f180ee6..d8c38eafa251 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -113,6 +113,15 @@  struct lruvec_stat {
 	long count[NR_VM_NODE_STAT_ITEMS];
 };
 
+/*
+ * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
+ * which have elements charged to this memcg.
+ */
+struct memcg_shrinker_map {
+	struct rcu_head rcu;
+	unsigned long map[0];
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -126,6 +135,9 @@  struct mem_cgroup_per_node {
 
 	struct mem_cgroup_reclaim_iter	iter[DEF_PRIORITY + 1];
 
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_shrinker_map __rcu	*shrinker_map;
+#endif
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long		usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
@@ -1284,6 +1296,8 @@  static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return memcg ? memcg->kmemcg_id : -1;
 }
 
+extern int memcg_expand_shrinker_maps(int new_id);
+
 #else
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74247a580cdd..f81581e26667 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -320,6 +320,120 @@  EXPORT_SYMBOL(memcg_kmem_enabled_key);
 
 struct workqueue_struct *memcg_kmem_cache_wq;
 
+static int memcg_shrinker_map_size;
+static DEFINE_MUTEX(memcg_shrinker_map_mutex);
+
+static void memcg_free_shrinker_map_rcu(struct rcu_head *head)
+{
+	kvfree(container_of(head, struct memcg_shrinker_map, rcu));
+}
+
+static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg,
+					 int size, int old_size)
+{
+	struct memcg_shrinker_map *new, *old;
+	int nid;
+
+	lockdep_assert_held(&memcg_shrinker_map_mutex);
+
+	for_each_node(nid) {
+		old = rcu_dereference_protected(
+			mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
+		/* Not yet online memcg */
+		if (!old)
+			return 0;
+
+		new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
+		if (!new)
+			return -ENOMEM;
+
+		/* Set all old bits, clear all new bits */
+		memset(new->map, (int)0xff, old_size);
+		memset((void *)new->map + old_size, 0, size - old_size);
+
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
+		if (old)
+			call_rcu(&old->rcu, memcg_free_shrinker_map_rcu);
+	}
+
+	return 0;
+}
+
+static void memcg_free_shrinker_maps(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_per_node *pn;
+	struct memcg_shrinker_map *map;
+	int nid;
+
+	if (mem_cgroup_is_root(memcg))
+		return;
+
+	for_each_node(nid) {
+		pn = mem_cgroup_nodeinfo(memcg, nid);
+		map = rcu_dereference_protected(pn->shrinker_map, true);
+		if (map)
+			kvfree(map);
+		rcu_assign_pointer(pn->shrinker_map, NULL);
+	}
+}
+
+static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
+{
+	struct memcg_shrinker_map *map;
+	int nid, size, ret = 0;
+
+	if (mem_cgroup_is_root(memcg))
+		return 0;
+
+	mutex_lock(&memcg_shrinker_map_mutex);
+	size = memcg_shrinker_map_size;
+	for_each_node(nid) {
+		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
+		if (!map) {
+			memcg_free_shrinker_maps(memcg);
+			ret = -ENOMEM;
+			break;
+		}
+		rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
+	}
+	mutex_unlock(&memcg_shrinker_map_mutex);
+
+	return ret;
+}
+
+int memcg_expand_shrinker_maps(int new_id)
+{
+	int size, old_size, ret = 0;
+	struct mem_cgroup *memcg;
+
+	size = DIV_ROUND_UP(new_id + 1, BITS_PER_BYTE);
+	old_size = memcg_shrinker_map_size;
+	if (size <= old_size)
+		return 0;
+
+	mutex_lock(&memcg_shrinker_map_mutex);
+	if (!root_mem_cgroup)
+		goto unlock;
+
+	for_each_mem_cgroup(memcg) {
+		if (mem_cgroup_is_root(memcg))
+			continue;
+		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
+		if (ret)
+			goto unlock;
+	}
+unlock:
+	if (!ret)
+		memcg_shrinker_map_size = size;
+	mutex_unlock(&memcg_shrinker_map_mutex);
+	return ret;
+}
+#else /* CONFIG_MEMCG_KMEM */
+static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { }
 #endif /* CONFIG_MEMCG_KMEM */
 
 /**
@@ -4541,6 +4655,11 @@  static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	if (memcg_alloc_shrinker_maps(memcg)) {
+		mem_cgroup_id_remove(memcg);
+		return -ENOMEM;
+	}
+
 	/* Online state pins memcg ID, memcg ID pins CSS */
 	atomic_set(&memcg->id.ref, 1);
 	css_get(css);
@@ -4593,6 +4712,7 @@  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	vmpressure_cleanup(&memcg->vmpressure);
 	cancel_work_sync(&memcg->high_work);
 	mem_cgroup_remove_from_trees(memcg);
+	memcg_free_shrinker_maps(memcg);
 	memcg_free_kmem(memcg);
 	mem_cgroup_free(memcg);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f9ca6b57d72f..d8b1958c751d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -182,6 +182,11 @@  static int prealloc_memcg_shrinker(struct shrinker *shrinker)
 	if (id < 0)
 		goto unlock;
 
+	if (memcg_expand_shrinker_maps(id)) {
+		idr_remove(&shrinker_idr, id);
+		goto unlock;
+	}
+
 	if (id >= shrinker_nr_max)
 		shrinker_nr_max = id + 1;
 	shrinker->id = id;