Message ID | 153063056619.1818.12550500883688681076.stgit@localhost.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > Imagine a big node with many cpus, memory cgroups and containers. > Let we have 200 containers, every container has 10 mounts, > and 10 cgroups. All container tasks don't touch foreign > containers mounts. If there is intensive pages write, > and global reclaim happens, a writing task has to iterate > over all memcgs to shrink slab, before it's able to go > to shrink_page_list(). > > Iteration over all the memcg slabs is very expensive: > the task has to visit 200 * 10 = 2000 shrinkers > for every memcg, and since there are 2000 memcgs, > the total calls are 2000 * 2000 = 4000000. > > So, the shrinker makes 4 million do_shrink_slab() calls > just to try to isolate SWAP_CLUSTER_MAX pages in one > of the actively writing memcg via shrink_page_list(). > I've observed a node spending almost 100% in kernel, > making useless iteration over already shrinked slab. > > This patch adds bitmap of memcg-aware shrinkers to memcg. > The size of the bitmap depends on bitmap_nr_ids, and during > memcg life it's maintained to be enough to fit bitmap_nr_ids > shrinkers. Every bit in the map is related to corresponding > shrinker id. > > Next patches will maintain set bit only for really charged > memcg. This will allow shrink_slab() to increase its > performance in significant way. See the last patch for > the numbers. > > ... > > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker) > if (id < 0) > goto unlock; > > + if (memcg_expand_shrinker_maps(id)) { > + idr_remove(&shrinker_idr, id); > + goto unlock; > + } > + > if (id >= shrinker_nr_max) > shrinker_nr_max = id + 1; > shrinker->id = id; This function ends up being a rather sad little thing. : static int prealloc_memcg_shrinker(struct shrinker *shrinker) : { : int id, ret = -ENOMEM; : : down_write(&shrinker_rwsem); : id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); : if (id < 0) : goto unlock; : : if (memcg_expand_shrinker_maps(id)) { : idr_remove(&shrinker_idr, id); : goto unlock; : } : : if (id >= shrinker_nr_max) : shrinker_nr_max = id + 1; : shrinker->id = id; : ret = 0; : unlock: : up_write(&shrinker_rwsem); : return ret; : } - there's no need to call memcg_expand_shrinker_maps() unless id >= shrinker_nr_max so why not move the code and avoid calling memcg_expand_shrinker_maps() in most cases. - why aren't we decreasing shrinker_nr_max in unregister_memcg_shrinker()? That's easy to do, avoids pointless work in shrink_slab_memcg() and avoids memory waste in future prealloc_memcg_shrinker() calls. It should be possible to find the highest ID in an IDR tree with a straightforward descent of the underlying radix tree, but I doubt if that has been wired up. Otherwise a simple loop in unregister_memcg_shrinker() would be needed.
On 03.07.2018 23:50, Andrew Morton wrote: > On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > >> Imagine a big node with many cpus, memory cgroups and containers. >> Let we have 200 containers, every container has 10 mounts, >> and 10 cgroups. All container tasks don't touch foreign >> containers mounts. If there is intensive pages write, >> and global reclaim happens, a writing task has to iterate >> over all memcgs to shrink slab, before it's able to go >> to shrink_page_list(). >> >> Iteration over all the memcg slabs is very expensive: >> the task has to visit 200 * 10 = 2000 shrinkers >> for every memcg, and since there are 2000 memcgs, >> the total calls are 2000 * 2000 = 4000000. >> >> So, the shrinker makes 4 million do_shrink_slab() calls >> just to try to isolate SWAP_CLUSTER_MAX pages in one >> of the actively writing memcg via shrink_page_list(). >> I've observed a node spending almost 100% in kernel, >> making useless iteration over already shrinked slab. >> >> This patch adds bitmap of memcg-aware shrinkers to memcg. >> The size of the bitmap depends on bitmap_nr_ids, and during >> memcg life it's maintained to be enough to fit bitmap_nr_ids >> shrinkers. Every bit in the map is related to corresponding >> shrinker id. >> >> Next patches will maintain set bit only for really charged >> memcg. This will allow shrink_slab() to increase its >> performance in significant way. See the last patch for >> the numbers. >> >> ... >> >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker) >> if (id < 0) >> goto unlock; >> >> + if (memcg_expand_shrinker_maps(id)) { >> + idr_remove(&shrinker_idr, id); >> + goto unlock; >> + } >> + >> if (id >= shrinker_nr_max) >> shrinker_nr_max = id + 1; >> shrinker->id = id; > > This function ends up being a rather sad little thing. > > : static int prealloc_memcg_shrinker(struct shrinker *shrinker) > : { > : int id, ret = -ENOMEM; > : > : down_write(&shrinker_rwsem); > : id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); > : if (id < 0) > : goto unlock; > : > : if (memcg_expand_shrinker_maps(id)) { > : idr_remove(&shrinker_idr, id); > : goto unlock; > : } > : > : if (id >= shrinker_nr_max) > : shrinker_nr_max = id + 1; > : shrinker->id = id; > : ret = 0; > : unlock: > : up_write(&shrinker_rwsem); > : return ret; > : } > > - there's no need to call memcg_expand_shrinker_maps() unless id >= > shrinker_nr_max so why not move the code and avoid calling > memcg_expand_shrinker_maps() in most cases. OK > - why aren't we decreasing shrinker_nr_max in > unregister_memcg_shrinker()? That's easy to do, avoids pointless > work in shrink_slab_memcg() and avoids memory waste in future > prealloc_memcg_shrinker() calls. You sure, but there are some things. Initially I went in the same way as memcg_nr_cache_ids is made and just took the same x2 arithmetic. It never decreases, so it looked good to make shrinker maps like it. It's the only reason, so, it should not be a problem to rework. The only moment is Vladimir strongly recommends modularity, i.e. to have memcg_shrinker_map_size and shrinker_nr_max as different variables. After the rework we won't be able to have this anymore, since memcontrol.c will have to know actual shrinker_nr_max value and it will have to be exported. Could this be a problem? > It should be possible to find the highest ID in an IDR tree with a > straightforward descent of the underlying radix tree, but I doubt if > that has been wired up. Otherwise a simple loop in > unregister_memcg_shrinker() would be needed. Kirill
On Wed, 4 Jul 2018 18:51:12 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > - why aren't we decreasing shrinker_nr_max in > > unregister_memcg_shrinker()? That's easy to do, avoids pointless > > work in shrink_slab_memcg() and avoids memory waste in future > > prealloc_memcg_shrinker() calls. > > You sure, but there are some things. Initially I went in the same way > as memcg_nr_cache_ids is made and just took the same x2 arithmetic. > It never decreases, so it looked good to make shrinker maps like it. > It's the only reason, so, it should not be a problem to rework. > > The only moment is Vladimir strongly recommends modularity, i.e. > to have memcg_shrinker_map_size and shrinker_nr_max as different variables. For what reasons? > After the rework we won't be able to have this anymore, since memcontrol.c > will have to know actual shrinker_nr_max value and it will have to be exported. > > Could this be a problem? Vladimir?
On Tue, Jul 03, 2018 at 01:50:00PM -0700, Andrew Morton wrote: > It should be possible to find the highest ID in an IDR tree with a > straightforward descent of the underlying radix tree, but I doubt if > that has been wired up. Otherwise a simple loop in > unregister_memcg_shrinker() would be needed. Feature request received. I've actually implemented it for the XArray already, but it should be easy to do for the IDR too.
On Tue, Jul 03, 2018 at 01:50:00PM -0700, Andrew Morton wrote: > On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > > Imagine a big node with many cpus, memory cgroups and containers. > > Let we have 200 containers, every container has 10 mounts, > > and 10 cgroups. All container tasks don't touch foreign > > containers mounts. If there is intensive pages write, > > and global reclaim happens, a writing task has to iterate > > over all memcgs to shrink slab, before it's able to go > > to shrink_page_list(). > > > > Iteration over all the memcg slabs is very expensive: > > the task has to visit 200 * 10 = 2000 shrinkers > > for every memcg, and since there are 2000 memcgs, > > the total calls are 2000 * 2000 = 4000000. > > > > So, the shrinker makes 4 million do_shrink_slab() calls > > just to try to isolate SWAP_CLUSTER_MAX pages in one > > of the actively writing memcg via shrink_page_list(). > > I've observed a node spending almost 100% in kernel, > > making useless iteration over already shrinked slab. > > > > This patch adds bitmap of memcg-aware shrinkers to memcg. > > The size of the bitmap depends on bitmap_nr_ids, and during > > memcg life it's maintained to be enough to fit bitmap_nr_ids > > shrinkers. Every bit in the map is related to corresponding > > shrinker id. > > > > Next patches will maintain set bit only for really charged > > memcg. This will allow shrink_slab() to increase its > > performance in significant way. See the last patch for > > the numbers. > > > > ... > > > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker) > > if (id < 0) > > goto unlock; > > > > + if (memcg_expand_shrinker_maps(id)) { > > + idr_remove(&shrinker_idr, id); > > + goto unlock; > > + } > > + > > if (id >= shrinker_nr_max) > > shrinker_nr_max = id + 1; > > shrinker->id = id; > > This function ends up being a rather sad little thing. > > : static int prealloc_memcg_shrinker(struct shrinker *shrinker) > : { > : int id, ret = -ENOMEM; > : > : down_write(&shrinker_rwsem); > : id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); > : if (id < 0) > : goto unlock; > : > : if (memcg_expand_shrinker_maps(id)) { > : idr_remove(&shrinker_idr, id); > : goto unlock; > : } > : > : if (id >= shrinker_nr_max) > : shrinker_nr_max = id + 1; > : shrinker->id = id; > : ret = 0; > : unlock: > : up_write(&shrinker_rwsem); > : return ret; > : } > > - there's no need to call memcg_expand_shrinker_maps() unless id >= > shrinker_nr_max so why not move the code and avoid calling > memcg_expand_shrinker_maps() in most cases. memcg_expand_shrinker_maps will return immediately if per memcg shrinker maps can accommodate the new id. Since prealloc_memcg_shrinker is definitely not a hot path, I don't see any penalty in calling this function on each prealloc_memcg_shrinker invocation. > > - why aren't we decreasing shrinker_nr_max in > unregister_memcg_shrinker()? That's easy to do, avoids pointless > work in shrink_slab_memcg() and avoids memory waste in future > prealloc_memcg_shrinker() calls. We can shrink the maps, but IMHO it isn't worth the complexity it would introduce, because in my experience if a workload used N mount points (containers, whatever) at some point of its lifetime, it is likely to use the same amount in the future. > > It should be possible to find the highest ID in an IDR tree with a > straightforward descent of the underlying radix tree, but I doubt if > that has been wired up. Otherwise a simple loop in > unregister_memcg_shrinker() would be needed. > >
On Thu, Jul 05, 2018 at 03:10:30PM -0700, Andrew Morton wrote: > On Wed, 4 Jul 2018 18:51:12 +0300 Kirill Tkhai <ktkhai@virtuozzo.com> wrote: > > > > - why aren't we decreasing shrinker_nr_max in > > > unregister_memcg_shrinker()? That's easy to do, avoids pointless > > > work in shrink_slab_memcg() and avoids memory waste in future > > > prealloc_memcg_shrinker() calls. > > > > You sure, but there are some things. Initially I went in the same way > > as memcg_nr_cache_ids is made and just took the same x2 arithmetic. > > It never decreases, so it looked good to make shrinker maps like it. > > It's the only reason, so, it should not be a problem to rework. > > > > The only moment is Vladimir strongly recommends modularity, i.e. > > to have memcg_shrinker_map_size and shrinker_nr_max as different variables. > > For what reasons? Having the only global variable updated in vmscan.c and used in memcontrol.c or vice versa didn't look good to me. So I suggested to introduce two separate static variables: one for max shrinker id (local to vmscan.c) and another for max allocated size of per memcg shrinker maps (local to memcontrol.c). Having the two variables instead of one allows us to define a clear API between memcontrol.c and shrinker infrastructure without sharing variables, which makes the code easier to follow IMHO. It is also more flexible: with the two variables we can decrease shrinker_id_max when a shrinker is destroyed so that we can speed up shrink_slab - this is fairly easy to do - but leave per memcg shrinker maps the same size, because shrinking them is rather complicated and doesn't seem to be worthwhile - the workload is likely to use the same amount of memory again in the future. > > > After the rework we won't be able to have this anymore, since memcontrol.c > > will have to know actual shrinker_nr_max value and it will have to be exported. Not necessarily. You can pass max shrinker id instead of the new id to memcontrol.c in function arguments. But as I said before, I really don't think that shrinking per memcg maps would make much sense. > > > > Could this be a problem?
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 62309f180ee6..d8c38eafa251 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -113,6 +113,15 @@ struct lruvec_stat { long count[NR_VM_NODE_STAT_ITEMS]; }; +/* + * Bitmap of shrinker::id corresponding to memcg-aware shrinkers, + * which have elements charged to this memcg. + */ +struct memcg_shrinker_map { + struct rcu_head rcu; + unsigned long map[0]; +}; + /* * per-zone information in memory controller. */ @@ -126,6 +135,9 @@ struct mem_cgroup_per_node { struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1]; +#ifdef CONFIG_MEMCG_KMEM + struct memcg_shrinker_map __rcu *shrinker_map; +#endif struct rb_node tree_node; /* RB tree node */ unsigned long usage_in_excess;/* Set to the value by which */ /* the soft limit is exceeded*/ @@ -1284,6 +1296,8 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg) return memcg ? memcg->kmemcg_id : -1; } +extern int memcg_expand_shrinker_maps(int new_id); + #else #define for_each_memcg_cache_index(_idx) \ for (; NULL; ) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 74247a580cdd..f81581e26667 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -320,6 +320,120 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key); struct workqueue_struct *memcg_kmem_cache_wq; +static int memcg_shrinker_map_size; +static DEFINE_MUTEX(memcg_shrinker_map_mutex); + +static void memcg_free_shrinker_map_rcu(struct rcu_head *head) +{ + kvfree(container_of(head, struct memcg_shrinker_map, rcu)); +} + +static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, + int size, int old_size) +{ + struct memcg_shrinker_map *new, *old; + int nid; + + lockdep_assert_held(&memcg_shrinker_map_mutex); + + for_each_node(nid) { + old = rcu_dereference_protected( + mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true); + /* Not yet online memcg */ + if (!old) + return 0; + + new = kvmalloc(sizeof(*new) + size, GFP_KERNEL); + if (!new) + return -ENOMEM; + + /* Set all old bits, clear all new bits */ + memset(new->map, (int)0xff, old_size); + memset((void *)new->map + old_size, 0, size - old_size); + + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new); + if (old) + call_rcu(&old->rcu, memcg_free_shrinker_map_rcu); + } + + return 0; +} + +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) +{ + struct mem_cgroup_per_node *pn; + struct memcg_shrinker_map *map; + int nid; + + if (mem_cgroup_is_root(memcg)) + return; + + for_each_node(nid) { + pn = mem_cgroup_nodeinfo(memcg, nid); + map = rcu_dereference_protected(pn->shrinker_map, true); + if (map) + kvfree(map); + rcu_assign_pointer(pn->shrinker_map, NULL); + } +} + +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg) +{ + struct memcg_shrinker_map *map; + int nid, size, ret = 0; + + if (mem_cgroup_is_root(memcg)) + return 0; + + mutex_lock(&memcg_shrinker_map_mutex); + size = memcg_shrinker_map_size; + for_each_node(nid) { + map = kvzalloc(sizeof(*map) + size, GFP_KERNEL); + if (!map) { + memcg_free_shrinker_maps(memcg); + ret = -ENOMEM; + break; + } + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map); + } + mutex_unlock(&memcg_shrinker_map_mutex); + + return ret; +} + +int memcg_expand_shrinker_maps(int new_id) +{ + int size, old_size, ret = 0; + struct mem_cgroup *memcg; + + size = DIV_ROUND_UP(new_id + 1, BITS_PER_BYTE); + old_size = memcg_shrinker_map_size; + if (size <= old_size) + return 0; + + mutex_lock(&memcg_shrinker_map_mutex); + if (!root_mem_cgroup) + goto unlock; + + for_each_mem_cgroup(memcg) { + if (mem_cgroup_is_root(memcg)) + continue; + ret = memcg_expand_one_shrinker_map(memcg, size, old_size); + if (ret) + goto unlock; + } +unlock: + if (!ret) + memcg_shrinker_map_size = size; + mutex_unlock(&memcg_shrinker_map_mutex); + return ret; +} +#else /* CONFIG_MEMCG_KMEM */ +static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg) +{ + return 0; +} +static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { } #endif /* CONFIG_MEMCG_KMEM */ /** @@ -4541,6 +4655,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + if (memcg_alloc_shrinker_maps(memcg)) { + mem_cgroup_id_remove(memcg); + return -ENOMEM; + } + /* Online state pins memcg ID, memcg ID pins CSS */ atomic_set(&memcg->id.ref, 1); css_get(css); @@ -4593,6 +4712,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css) vmpressure_cleanup(&memcg->vmpressure); cancel_work_sync(&memcg->high_work); mem_cgroup_remove_from_trees(memcg); + memcg_free_shrinker_maps(memcg); memcg_free_kmem(memcg); mem_cgroup_free(memcg); } diff --git a/mm/vmscan.c b/mm/vmscan.c index f9ca6b57d72f..d8b1958c751d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker) if (id < 0) goto unlock; + if (memcg_expand_shrinker_maps(id)) { + idr_remove(&shrinker_idr, id); + goto unlock; + } + if (id >= shrinker_nr_max) shrinker_nr_max = id + 1; shrinker->id = id;