diff mbox series

[v2,1/2] mm, slab: Extend slab/shrink to shrink all memcg caches

Message ID 20190717202413.13237-2-longman@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm, slab: Extend slab/shrink to shrink all memcg caches | expand

Commit Message

Waiman Long July 17, 2019, 8:24 p.m. UTC
Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
file to shrink the slab by flushing out all the per-cpu slabs and free
slabs in partial lists. This can be useful to squeeze out a bit more memory
under extreme condition as well as making the active object counts in
/proc/slabinfo more accurate.

This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
if memcg sysfs is turned on, it is too cumbersome and impractical to
manage all those per-memcg sysfs files in a real production system.

So there is no practical way to shrink memcg caches.  Fix this by
enabling a proper write to the shrink sysfs file of the root cache
to scan all the available memcg caches and shrink them as well. For a
non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
on), only that cache will be shrunk when written.

On a 2-socket 64-core 256-thread arm64 system with 64k page after
a parallel kernel build, the the amount of memory occupied by slabs
before shrinking slabs were:

 # grep task_struct /proc/slabinfo
 task_struct        53137  53192   4288   61    4 : tunables    0    0
 0 : slabdata    872    872      0
 # grep "^S[lRU]" /proc/meminfo
 Slab:            3936832 kB
 SReclaimable:     399104 kB
 SUnreclaim:      3537728 kB

After shrinking slabs:

 # grep "^S[lRU]" /proc/meminfo
 Slab:            1356288 kB
 SReclaimable:     263296 kB
 SUnreclaim:      1092992 kB
 # grep task_struct /proc/slabinfo
 task_struct         2764   6832   4288   61    4 : tunables    0    0
 0 : slabdata    112    112      0

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 Documentation/ABI/testing/sysfs-kernel-slab | 12 ++++---
 mm/slab.h                                   |  1 +
 mm/slab_common.c                            | 37 +++++++++++++++++++++
 mm/slub.c                                   |  2 +-
 4 files changed, 47 insertions(+), 5 deletions(-)

Comments

Christoph Lameter (Ampere) July 18, 2019, 11:38 a.m. UTC | #1
On Wed, 17 Jul 2019, Waiman Long wrote:

> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> file to shrink the slab by flushing out all the per-cpu slabs and free
> slabs in partial lists. This can be useful to squeeze out a bit more memory
> under extreme condition as well as making the active object counts in
> /proc/slabinfo more accurate.

Acked-by: Christoph Lameter <cl@linux.com>

>  # grep task_struct /proc/slabinfo
>  task_struct        53137  53192   4288   61    4 : tunables    0    0
>  0 : slabdata    872    872      0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            3936832 kB
>  SReclaimable:     399104 kB
>  SUnreclaim:      3537728 kB
>
> After shrinking slabs:
>
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            1356288 kB
>  SReclaimable:     263296 kB
>  SUnreclaim:      1092992 kB

Well another indicator that it may not be a good decision to replicate the
whole set of slabs for each memcg. Migrate the memcg ownership into the
objects may allow the use of the same slab cache. In particular together
with the slab migration patches this may be a viable way to reduce memory
consumption.
Roman Gushchin July 18, 2019, 5:05 p.m. UTC | #2
On Thu, Jul 18, 2019 at 11:38:11AM +0000, Christopher Lameter wrote:
> On Wed, 17 Jul 2019, Waiman Long wrote:
> 
> > Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> > file to shrink the slab by flushing out all the per-cpu slabs and free
> > slabs in partial lists. This can be useful to squeeze out a bit more memory
> > under extreme condition as well as making the active object counts in
> > /proc/slabinfo more accurate.
> 
> Acked-by: Christoph Lameter <cl@linux.com>
> 
> >  # grep task_struct /proc/slabinfo
> >  task_struct        53137  53192   4288   61    4 : tunables    0    0
> >  0 : slabdata    872    872      0
> >  # grep "^S[lRU]" /proc/meminfo
> >  Slab:            3936832 kB
> >  SReclaimable:     399104 kB
> >  SUnreclaim:      3537728 kB
> >
> > After shrinking slabs:
> >
> >  # grep "^S[lRU]" /proc/meminfo
> >  Slab:            1356288 kB
> >  SReclaimable:     263296 kB
> >  SUnreclaim:      1092992 kB
> 
> Well another indicator that it may not be a good decision to replicate the
> whole set of slabs for each memcg. Migrate the memcg ownership into the
> objects may allow the use of the same slab cache. In particular together
> with the slab migration patches this may be a viable way to reduce memory
> consumption.
> 

Btw I'm working on an alternative solution. It's way too early to present
anything, but preliminary results are looking promising: slab memory usage
is decreased by 10-40% depending on the workload.

Thanks!
Michal Hocko July 19, 2019, 6:20 a.m. UTC | #3
On Wed 17-07-19 16:24:12, Waiman Long wrote:
> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
> file to shrink the slab by flushing out all the per-cpu slabs and free
> slabs in partial lists. This can be useful to squeeze out a bit more memory
> under extreme condition as well as making the active object counts in
> /proc/slabinfo more accurate.
> 
> This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
> option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
> if memcg sysfs is turned on, it is too cumbersome and impractical to
> manage all those per-memcg sysfs files in a real production system.
> 
> So there is no practical way to shrink memcg caches.  Fix this by
> enabling a proper write to the shrink sysfs file of the root cache
> to scan all the available memcg caches and shrink them as well. For a
> non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
> on), only that cache will be shrunk when written.

I would mention that memcg unawareness was an overlook more than
anything else. The interface is intended to shrink all pcp data of the
cache. The fact that we are using per-memcg internal caches is an
implementation detail.

> On a 2-socket 64-core 256-thread arm64 system with 64k page after
> a parallel kernel build, the the amount of memory occupied by slabs
> before shrinking slabs were:
> 
>  # grep task_struct /proc/slabinfo
>  task_struct        53137  53192   4288   61    4 : tunables    0    0
>  0 : slabdata    872    872      0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            3936832 kB
>  SReclaimable:     399104 kB
>  SUnreclaim:      3537728 kB
> 
> After shrinking slabs:
> 
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:            1356288 kB
>  SReclaimable:     263296 kB
>  SUnreclaim:      1092992 kB
>  # grep task_struct /proc/slabinfo
>  task_struct         2764   6832   4288   61    4 : tunables    0    0
>  0 : slabdata    112    112      0

Now that you are touching the documentation I would just add a note that
shrinking might be expensive and block other slab operations so it
should be used with some care.

> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Roman Gushchin <guro@fb.com>

The patch looks good to me. I do not feel qualified to give my ack but
it is definitely a change in the good direction.

Let's just be careful recommending people to use this as a workaround to
over caching and resulting tilted stats. That needs to be addressed
separately.

Thanks!

> ---
>  Documentation/ABI/testing/sysfs-kernel-slab | 12 ++++---
>  mm/slab.h                                   |  1 +
>  mm/slab_common.c                            | 37 +++++++++++++++++++++
>  mm/slub.c                                   |  2 +-
>  4 files changed, 47 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
> index 29601d93a1c2..94ffd47fc8d7 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-slab
> +++ b/Documentation/ABI/testing/sysfs-kernel-slab
> @@ -429,10 +429,14 @@ KernelVersion:	2.6.22
>  Contact:	Pekka Enberg <penberg@cs.helsinki.fi>,
>  		Christoph Lameter <cl@linux-foundation.org>
>  Description:
> -		The shrink file is written when memory should be reclaimed from
> -		a cache.  Empty partial slabs are freed and the partial list is
> -		sorted so the slabs with the fewest available objects are used
> -		first.
> +		The shrink file is used to enable some unused slab cache
> +		memory to be reclaimed from a cache.  Empty per-cpu
> +		or partial slabs are freed and the partial list is
> +		sorted so the slabs with the fewest available objects
> +		are used first.  It only accepts a value of "1" on
> +		write for shrinking the cache. Other input values are
> +		considered invalid.  If it is a root cache, all the
> +		child memcg caches will also be shrunk, if available.
>  
>  What:		/sys/kernel/slab/cache/slab_size
>  Date:		May 2007
> diff --git a/mm/slab.h b/mm/slab.h
> index 9057b8056b07..5bf615cb3f99 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -174,6 +174,7 @@ int __kmem_cache_shrink(struct kmem_cache *);
>  void __kmemcg_cache_deactivate(struct kmem_cache *s);
>  void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
>  void slab_kmem_cache_release(struct kmem_cache *);
> +void kmem_cache_shrink_all(struct kmem_cache *s);
>  
>  struct seq_file;
>  struct file;
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 807490fe217a..6491c3a41805 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -981,6 +981,43 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
>  }
>  EXPORT_SYMBOL(kmem_cache_shrink);
>  
> +/**
> + * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
> + * @s: The cache pointer
> + */
> +void kmem_cache_shrink_all(struct kmem_cache *s)
> +{
> +	struct kmem_cache *c;
> +
> +	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
> +		kmem_cache_shrink(s);
> +		return;
> +	}
> +
> +	get_online_cpus();
> +	get_online_mems();
> +	kasan_cache_shrink(s);
> +	__kmem_cache_shrink(s);
> +
> +	/*
> +	 * We have to take the slab_mutex to protect from the memcg list
> +	 * modification.
> +	 */
> +	mutex_lock(&slab_mutex);
> +	for_each_memcg_cache(c, s) {
> +		/*
> +		 * Don't need to shrink deactivated memcg caches.
> +		 */
> +		if (s->flags & SLAB_DEACTIVATED)
> +			continue;
> +		kasan_cache_shrink(c);
> +		__kmem_cache_shrink(c);
> +	}
> +	mutex_unlock(&slab_mutex);
> +	put_online_mems();
> +	put_online_cpus();
> +}
> +
>  bool slab_is_available(void)
>  {
>  	return slab_state >= UP;
> diff --git a/mm/slub.c b/mm/slub.c
> index e6c030e47364..9736eb10dcb8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5294,7 +5294,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
>  			const char *buf, size_t length)
>  {
>  	if (buf[0] == '1')
> -		kmem_cache_shrink(s);
> +		kmem_cache_shrink_all(s);
>  	else
>  		return -EINVAL;
>  	return length;
> -- 
> 2.18.1
Waiman Long July 19, 2019, 2:09 p.m. UTC | #4
On 7/19/19 2:20 AM, Michal Hocko wrote:
> On Wed 17-07-19 16:24:12, Waiman Long wrote:
>> Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
>> file to shrink the slab by flushing out all the per-cpu slabs and free
>> slabs in partial lists. This can be useful to squeeze out a bit more memory
>> under extreme condition as well as making the active object counts in
>> /proc/slabinfo more accurate.
>>
>> This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
>> option is usually not enabled and "slub_memcg_sysfs=1" not set. Even
>> if memcg sysfs is turned on, it is too cumbersome and impractical to
>> manage all those per-memcg sysfs files in a real production system.
>>
>> So there is no practical way to shrink memcg caches.  Fix this by
>> enabling a proper write to the shrink sysfs file of the root cache
>> to scan all the available memcg caches and shrink them as well. For a
>> non-root memcg cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is
>> on), only that cache will be shrunk when written.
> I would mention that memcg unawareness was an overlook more than
> anything else. The interface is intended to shrink all pcp data of the
> cache. The fact that we are using per-memcg internal caches is an
> implementation detail.
>
>> On a 2-socket 64-core 256-thread arm64 system with 64k page after
>> a parallel kernel build, the the amount of memory occupied by slabs
>> before shrinking slabs were:
>>
>>  # grep task_struct /proc/slabinfo
>>  task_struct        53137  53192   4288   61    4 : tunables    0    0
>>  0 : slabdata    872    872      0
>>  # grep "^S[lRU]" /proc/meminfo
>>  Slab:            3936832 kB
>>  SReclaimable:     399104 kB
>>  SUnreclaim:      3537728 kB
>>
>> After shrinking slabs:
>>
>>  # grep "^S[lRU]" /proc/meminfo
>>  Slab:            1356288 kB
>>  SReclaimable:     263296 kB
>>  SUnreclaim:      1092992 kB
>>  # grep task_struct /proc/slabinfo
>>  task_struct         2764   6832   4288   61    4 : tunables    0    0
>>  0 : slabdata    112    112      0
> Now that you are touching the documentation I would just add a note that
> shrinking might be expensive and block other slab operations so it
> should be used with some care.
>
Good point. I will update the patch to include such a note in the
documentation.

Thanks,
Longman
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..94ffd47fc8d7 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -429,10 +429,14 @@  KernelVersion:	2.6.22
 Contact:	Pekka Enberg <penberg@cs.helsinki.fi>,
 		Christoph Lameter <cl@linux-foundation.org>
 Description:
-		The shrink file is written when memory should be reclaimed from
-		a cache.  Empty partial slabs are freed and the partial list is
-		sorted so the slabs with the fewest available objects are used
-		first.
+		The shrink file is used to enable some unused slab cache
+		memory to be reclaimed from a cache.  Empty per-cpu
+		or partial slabs are freed and the partial list is
+		sorted so the slabs with the fewest available objects
+		are used first.  It only accepts a value of "1" on
+		write for shrinking the cache. Other input values are
+		considered invalid.  If it is a root cache, all the
+		child memcg caches will also be shrunk, if available.
 
 What:		/sys/kernel/slab/cache/slab_size
 Date:		May 2007
diff --git a/mm/slab.h b/mm/slab.h
index 9057b8056b07..5bf615cb3f99 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -174,6 +174,7 @@  int __kmem_cache_shrink(struct kmem_cache *);
 void __kmemcg_cache_deactivate(struct kmem_cache *s);
 void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
+void kmem_cache_shrink_all(struct kmem_cache *s);
 
 struct seq_file;
 struct file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 807490fe217a..6491c3a41805 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -981,6 +981,43 @@  int kmem_cache_shrink(struct kmem_cache *cachep)
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+/**
+ * kmem_cache_shrink_all - shrink a cache and all memcg caches for root cache
+ * @s: The cache pointer
+ */
+void kmem_cache_shrink_all(struct kmem_cache *s)
+{
+	struct kmem_cache *c;
+
+	if (!IS_ENABLED(CONFIG_MEMCG_KMEM) || !is_root_cache(s)) {
+		kmem_cache_shrink(s);
+		return;
+	}
+
+	get_online_cpus();
+	get_online_mems();
+	kasan_cache_shrink(s);
+	__kmem_cache_shrink(s);
+
+	/*
+	 * We have to take the slab_mutex to protect from the memcg list
+	 * modification.
+	 */
+	mutex_lock(&slab_mutex);
+	for_each_memcg_cache(c, s) {
+		/*
+		 * Don't need to shrink deactivated memcg caches.
+		 */
+		if (s->flags & SLAB_DEACTIVATED)
+			continue;
+		kasan_cache_shrink(c);
+		__kmem_cache_shrink(c);
+	}
+	mutex_unlock(&slab_mutex);
+	put_online_mems();
+	put_online_cpus();
+}
+
 bool slab_is_available(void)
 {
 	return slab_state >= UP;
diff --git a/mm/slub.c b/mm/slub.c
index e6c030e47364..9736eb10dcb8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5294,7 +5294,7 @@  static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
 	if (buf[0] == '1')
-		kmem_cache_shrink(s);
+		kmem_cache_shrink_all(s);
 	else
 		return -EINVAL;
 	return length;