mm,slab,vmscan: accumulate gradual pressure on small slabs

Message ID	20190128143535.7767c397@imladris.surriel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: best guess record for domain of riel@shelob.surriel.com designates 96.67.55.147 as permitted sender) client-ip=96.67.55.147; Date: Mon, 28 Jan 2019 14:35:35 -0500 From: Rik van Riel <riel@surriel.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, kernel-team@fb.com, Johannes Weiner <hannes@cmpxchg.org>, Chris Mason <clm@fb.com>, Roman Gushchin <guro@fb.com>, Andrew Morton <akpm@linux-foundation.org>, Michal Hocko <mhocko@suse.com> Subject: [PATCH] mm,slab,vmscan: accumulate gradual pressure on small slabs Message-ID: <20190128143535.7767c397@imladris.surriel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm,slab,vmscan: accumulate gradual pressure on small slabs \| expand mm,slab,vmscan: accumulate gradual pressure on small slabs

Rik van Riel Jan. 28, 2019, 7:35 p.m. UTC

There are a few issues with the way the number of slab objects to
scan is calculated in do_shrink_slab.  First, for zero-seek slabs,
we could leave the last object around forever. That could result
in pinning a dying cgroup into memory, instead of reclaiming it.
The fix for that is trivial.

Secondly, small slabs receive much more pressure, relative to their
size, than larger slabs, due to "rounding up" the minimum number of
scanned objects to batch_size.

We can keep the pressure on all slabs equal relative to their size
by accumulating the scan pressure on small slabs over time, resulting
in sometimes scanning an object, instead of always scanning several.

This results in lower system CPU use, and a lower major fault rate,
as actively used entries from smaller caches get reclaimed less
aggressively, and need to be reloaded/recreated less often.

Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Mason <clm@fb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: kernel-team@fb.com
Tested-by: Chris Mason <clm@fb.com>
---
 include/linux/shrinker.h |  1 +
 mm/vmscan.c              | 16 +++++++++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

Roman Gushchin Jan. 28, 2019, 7:45 p.m. UTC | #1

On Mon, Jan 28, 2019 at 02:35:35PM -0500, Rik van Riel wrote:
> There are a few issues with the way the number of slab objects to
> scan is calculated in do_shrink_slab.  First, for zero-seek slabs,
> we could leave the last object around forever. That could result
> in pinning a dying cgroup into memory, instead of reclaiming it.
> The fix for that is trivial.
> 
> Secondly, small slabs receive much more pressure, relative to their
> size, than larger slabs, due to "rounding up" the minimum number of
> scanned objects to batch_size.
> 
> We can keep the pressure on all slabs equal relative to their size
> by accumulating the scan pressure on small slabs over time, resulting
> in sometimes scanning an object, instead of always scanning several.
> 
> This results in lower system CPU use, and a lower major fault rate,
> as actively used entries from smaller caches get reclaimed less
> aggressively, and need to be reloaded/recreated less often.
> 
> Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
> Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Chris Mason <clm@fb.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: kernel-team@fb.com
> Tested-by: Chris Mason <clm@fb.com>

Hi, Rik!

There is a couple of formatting issues (see below), but other than that
the patch looks very good to me. Thanks!

Acked-by: Roman Gushchin <guro@fb.com>

> ---
>  include/linux/shrinker.h |  1 +
>  mm/vmscan.c              | 16 +++++++++++++---
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 9443cafd1969..7a9a1a0f935c 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -65,6 +65,7 @@ struct shrinker {
>  
>  	long batch;	/* reclaim batch size, 0 = default */
>  	int seeks;	/* seeks to recreate an obj */
> +	int small_scan;	/* accumulate pressure on slabs with few objects */
>  	unsigned flags;
>  
>  	/* These are for internal use */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a714c4f800e9..0e375bd7a8b6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -488,18 +488,28 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		 * them aggressively under memory pressure to keep
>  		 * them from causing refetches in the IO caches.
>  		 */
> -		delta = freeable / 2;
> +		delta = (freeable + 1)/ 2;
                                      ^
                                      A space is missing here.
>  	}
>  
>  	/*
>  	 * Make sure we apply some minimal pressure on default priority
> -	 * even on small cgroups. Stale objects are not only consuming memory
> +	 * even on small cgroups, by accumulating pressure across multiple
> +	 * slab shrinker runs. Stale objects are not only consuming memory
>  	 * by themselves, but can also hold a reference to a dying cgroup,
>  	 * preventing it from being reclaimed. A dying cgroup with all
>  	 * corresponding structures like per-cpu stats and kmem caches
>  	 * can be really big, so it may lead to a significant waste of memory.
>  	 */
> -	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
> +	if (!delta) {
> +		shrinker->small_scan += freeable;
> +
> +		delta = shrinker->small_scan >> priority;
> +		shrinker->small_scan -= delta << priority;
> +
> +		delta *= 4;
> +		do_div(delta, shrinker->seeks);
> +

This empty line can be removed, I believe.

> +	}
>  
>  	total_scan += delta;
>  	if (total_scan < 0) {
>

Andrew Morton Jan. 28, 2019, 7:54 p.m. UTC | #2

On Mon, 28 Jan 2019 14:35:35 -0500 Rik van Riel <riel@surriel.com> wrote:

> There are a few issues with the way the number of slab objects to
> scan is calculated in do_shrink_slab.  First, for zero-seek slabs,
> we could leave the last object around forever. That could result
> in pinning a dying cgroup into memory, instead of reclaiming it.
> The fix for that is trivial.
> 
> Secondly, small slabs receive much more pressure, relative to their
> size, than larger slabs, due to "rounding up" the minimum number of
> scanned objects to batch_size.
> 
> We can keep the pressure on all slabs equal relative to their size
> by accumulating the scan pressure on small slabs over time, resulting
> in sometimes scanning an object, instead of always scanning several.
> 
> This results in lower system CPU use, and a lower major fault rate,
> as actively used entries from smaller caches get reclaimed less
> aggressively, and need to be reloaded/recreated less often.
> 
> Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
> Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Chris Mason <clm@fb.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: kernel-team@fb.com
> Tested-by: Chris Mason <clm@fb.com>

I added your Signed-off-by:

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -488,18 +488,28 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		 * them aggressively under memory pressure to keep
>  		 * them from causing refetches in the IO caches.
>  		 */
> -		delta = freeable / 2;
> +		delta = (freeable + 1)/ 2;
>  	}
>  
>  	/*
>  	 * Make sure we apply some minimal pressure on default priority
> -	 * even on small cgroups. Stale objects are not only consuming memory
> +	 * even on small cgroups, by accumulating pressure across multiple
> +	 * slab shrinker runs. Stale objects are not only consuming memory
>  	 * by themselves, but can also hold a reference to a dying cgroup,
>  	 * preventing it from being reclaimed. A dying cgroup with all
>  	 * corresponding structures like per-cpu stats and kmem caches
>  	 * can be really big, so it may lead to a significant waste of memory.
>  	 */
> -	delta = max_t(unsigned long long, delta, min(freeable, batch_size));
> +	if (!delta) {
> +		shrinker->small_scan += freeable;
> +
> +		delta = shrinker->small_scan >> priority;
> +		shrinker->small_scan -= delta << priority;
> +
> +		delta *= 4;
> +		do_div(delta, shrinker->seeks);

What prevents shrinker->small_scan from over- or underflowing over time?

> +	}
>  
>  	total_scan += delta;
>  	if (total_scan < 0) {

I'll add this:





whitespace fixes, per Roman

--- a/mm/vmscan.c~mmslabvmscan-accumulate-gradual-pressure-on-small-slabs-fix
+++ a/mm/vmscan.c
@@ -488,7 +488,7 @@ static unsigned long do_shrink_slab(stru
 		 * them aggressively under memory pressure to keep
 		 * them from causing refetches in the IO caches.
 		 */
-		delta = (freeable + 1)/ 2;
+		delta = (freeable + 1) / 2;
 	}
 
 	/*
@@ -508,7 +508,6 @@ static unsigned long do_shrink_slab(stru
 
 		delta *= 4;
 		do_div(delta, shrinker->seeks);
-
 	}
 
 	total_scan += delta;

Rik van Riel Jan. 28, 2019, 8:03 p.m. UTC | #3

On Mon, 2019-01-28 at 11:54 -0800, Andrew Morton wrote:
> On Mon, 28 Jan 2019 14:35:35 -0500 Rik van Riel <riel@surriel.com>
> wrote:
> 
> >  	/*
> >  	 * Make sure we apply some minimal pressure on default priority
> > -	 * even on small cgroups. Stale objects are not only consuming
> > memory
> > +	 * even on small cgroups, by accumulating pressure across
> > multiple
> > +	 * slab shrinker runs. Stale objects are not only consuming
> > memory
> >  	 * by themselves, but can also hold a reference to a dying
> > cgroup,
> >  	 * preventing it from being reclaimed. A dying cgroup with all
> >  	 * corresponding structures like per-cpu stats and kmem caches
> >  	 * can be really big, so it may lead to a significant waste of
> > memory.
> >  	 */
> > -	delta = max_t(unsigned long long, delta, min(freeable,
> > batch_size));
> > +	if (!delta) {
> > +		shrinker->small_scan += freeable;
> > +
> > +		delta = shrinker->small_scan >> priority;
> > +		shrinker->small_scan -= delta << priority;
> > +
> > +		delta *= 4;
> > +		do_div(delta, shrinker->seeks);
> 
> What prevents shrinker->small_scan from over- or underflowing over
> time?

We only go into this code path if
delta >> DEF_PRIORITY is zero.

That is, freeable is smaller than 4096.

> I'll add this:

> whitespace fixes, per Roman

Awesome, thank you!

Andrew Morton Jan. 28, 2019, 8:10 p.m. UTC | #4

On Mon, 28 Jan 2019 15:03:28 -0500 Rik van Riel <riel@surriel.com> wrote:

> On Mon, 2019-01-28 at 11:54 -0800, Andrew Morton wrote:
> > On Mon, 28 Jan 2019 14:35:35 -0500 Rik van Riel <riel@surriel.com>
> > wrote:
> > 
> > >  	/*
> > >  	 * Make sure we apply some minimal pressure on default priority
> > > -	 * even on small cgroups. Stale objects are not only consuming
> > > memory
> > > +	 * even on small cgroups, by accumulating pressure across
> > > multiple
> > > +	 * slab shrinker runs. Stale objects are not only consuming
> > > memory
> > >  	 * by themselves, but can also hold a reference to a dying
> > > cgroup,
> > >  	 * preventing it from being reclaimed. A dying cgroup with all
> > >  	 * corresponding structures like per-cpu stats and kmem caches
> > >  	 * can be really big, so it may lead to a significant waste of
> > > memory.
> > >  	 */
> > > -	delta = max_t(unsigned long long, delta, min(freeable,
> > > batch_size));
> > > +	if (!delta) {
> > > +		shrinker->small_scan += freeable;
> > > +
> > > +		delta = shrinker->small_scan >> priority;
> > > +		shrinker->small_scan -= delta << priority;
> > > +
> > > +		delta *= 4;
> > > +		do_div(delta, shrinker->seeks);
> > 
> > What prevents shrinker->small_scan from over- or underflowing over
> > time?
> 
> We only go into this code path if
> delta >> DEF_PRIORITY is zero.
> 
> That is, freeable is smaller than 4096.
> 

I'm still not understanding.  If `freeable' always has a value of (say)
1, we'll eventually overflow shrinker->small_scan?  Or at least, it's
unobvious why this cannot happen.

Rik van Riel Jan. 28, 2019, 8:34 p.m. UTC | #5

On Mon, 2019-01-28 at 12:10 -0800, Andrew Morton wrote:
> On Mon, 28 Jan 2019 15:03:28 -0500 Rik van Riel <riel@surriel.com>
> wrote:
> 
> > On Mon, 2019-01-28 at 11:54 -0800, Andrew Morton wrote:
> > > On Mon, 28 Jan 2019 14:35:35 -0500 Rik van Riel <riel@surriel.com
> > > >
> > > wrote:
> > > 
> > > > memory.
> > > >  	 */
> > > > -	delta = max_t(unsigned long long, delta, min(freeable,
> > > > batch_size));
> > > > +	if (!delta) {
> > > > +		shrinker->small_scan += freeable;
> > > > +
> > > > +		delta = shrinker->small_scan >> priority;
> > > > +		shrinker->small_scan -= delta << priority;

When delta is a non-zero number, we subtract (delta << priority)
from shrinker->small_scan.

That should happen every time delta >= (1<<priority), which is
4096 for DEF_PRIORITY.

> > > > +
> > > > +		delta *= 4;
> > > > +		do_div(delta, shrinker->seeks);
> > > 
> > > What prevents shrinker->small_scan from over- or underflowing
> > > over
> > > time?
> > 
> > We only go into this code path if
> > delta >> DEF_PRIORITY is zero.
> > 
> > That is, freeable is smaller than 4096.
> > 
> 
> I'm still not understanding.  If `freeable' always has a value of
> (say)
> 1, we'll eventually overflow shrinker->small_scan?  Or at least, it's
> unobvious why this cannot happen.

Johannes Weiner Jan. 28, 2019, 9:31 p.m. UTC | #6

On Mon, Jan 28, 2019 at 02:35:35PM -0500, Rik van Riel wrote:
> There are a few issues with the way the number of slab objects to
> scan is calculated in do_shrink_slab.  First, for zero-seek slabs,
> we could leave the last object around forever. That could result
> in pinning a dying cgroup into memory, instead of reclaiming it.
> The fix for that is trivial.
> 
> Secondly, small slabs receive much more pressure, relative to their
> size, than larger slabs, due to "rounding up" the minimum number of
> scanned objects to batch_size.
> 
> We can keep the pressure on all slabs equal relative to their size
> by accumulating the scan pressure on small slabs over time, resulting
> in sometimes scanning an object, instead of always scanning several.
> 
> This results in lower system CPU use, and a lower major fault rate,
> as actively used entries from smaller caches get reclaimed less
> aggressively, and need to be reloaded/recreated less often.
> 
> Fixes: 4b85afbdacd2 ("mm: zero-seek shrinkers")
> Fixes: 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects")
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Chris Mason <clm@fb.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: kernel-team@fb.com
> Tested-by: Chris Mason <clm@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

mm,slab,vmscan: accumulate gradual pressure on small slabs

Commit Message

Comments

Patch