[12/28] shrinker: defer work only to kswapd

Message ID	20191031234618.15403-13-david@fromorbit.com (mailing list archive)
State	Deferred, archived
Headers	show Return-Path: <SRS0=K+Ru=YY=vger.kernel.org=linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 12/28] shrinker: defer work only to kswapd Date: Fri, 1 Nov 2019 10:46:02 +1100 Message-Id: <20191031234618.15403-13-david@fromorbit.com> In-Reply-To: <20191031234618.15403-1-david@fromorbit.com> References: <20191031234618.15403-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	mm, xfs: non-blocking inode reclaim \| expand [00/28] mm, xfs: non-blocking inode reclaim [01/28] xfs: Lower CIL flush limit for large logs [02/28] xfs: Throttle commits on delayed background CIL push [03/28] xfs: don't allow log IO to be throttled [04/28] xfs: Improve metadata buffer reclaim accountability [05/28] xfs: correctly acount for reclaimable slabs [06/28] xfs: factor common AIL item deletion code [07/28] xfs: tail updates only need to occur when LSN changes [08/28] xfs: factor inode lookup from xfs_ifree_cluster [09/28] mm: directed shrinker work deferral [10/28] shrinkers: use defer_work for GFP_NOFS sensitive shrinkers [11/28] mm: factor shrinker work calculations [12/28] shrinker: defer work only to kswapd [13/28] shrinker: clean up variable types and tracepoints [14/28] mm: reclaim_state records pages reclaimed, not slabs [15/28] mm: back off direct reclaim on excessive shrinker deferral [16/28] mm: kswapd backoff for shrinkers [17/28] xfs: synchronous AIL pushing [18/28] xfs: don't block kswapd in inode reclaim [19/28] xfs: reduce kswapd blocking on inode locking. [20/28] xfs: kill background reclaim work [21/28] xfs: use AIL pushing for inode reclaim IO [22/28] xfs: remove mode from xfs_reclaim_inodes() [23/28] xfs: track reclaimable inodes using a LRU list [24/28] xfs: reclaim inodes from the LRU [25/28] xfs: remove unusued old inode reclaim code [26/28] xfs: use xfs_ail_push_all in xfs_reclaim_inodes [27/28] rwsem: introduce down/up_write_non_owner [28/28] xfs: rework unreferenced inode lookups

Message ID

20191031234618.15403-13-david@fromorbit.com (mailing list archive)

State

Deferred, archived

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org
Subject: [PATCH 12/28] shrinker: defer work only to kswapd
Date: Fri,  1 Nov 2019 10:46:02 +1100
Message-Id: <20191031234618.15403-13-david@fromorbit.com>
In-Reply-To: <20191031234618.15403-1-david@fromorbit.com>
References: <20191031234618.15403-1-david@fromorbit.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk

Series

mm, xfs: non-blocking inode reclaim | expand

Commit Message

Dave Chinner Oct. 31, 2019, 11:46 p.m. UTC

From: Dave Chinner <dchinner@redhat.com>

Right now deferred work is picked up by whatever GFP_KERNEL context
reclaimer that wins the race to empty the node's deferred work
counter. However, if there are lots of direct reclaimers, that
work might be continually picked up by contexts taht can't do any
work and so the opportunities to do the work are missed by contexts
that could do them.

A further problem with the current code is that the deferred work
can be picked up by a random direct reclaimer, resulting in that
specific process having to do all the deferred reclaim work and
hence can take extremely long latencies if the reclaim work blocks
regularly. This is not good for direct reclaim fairness or for
minimising long tail latency events.

To avoid these problems, simply limit deferred work to kswapd
contexts. We know kswapd is a context that can always do reclaim
work, and hence deferring work to kswapd allows the deferred work to
be done in the background and not adversely affect any specific
process context doing direct reclaim.

The advantage of this is that amount of work to be done in direct
reclaim is now bound and predictable - it is entirely based on
the cache's freeable objects and the reclaim priority. hence all
direct reclaimers running at the same time should be doing
relatively equal amounts of work, thereby reducing the incidence of
long tail latencies due to uneven reclaim workloads.

Note that we use signed integers for everything except the freed
count as the returns from the shrinker callouts cannot be guaranteed
untainted. Indeed, the shrinkers can return scan counts larger that
were fed in, so we need scan counts to underflow in a detectable
manner to terminate loops. This is necessary to avoid a misbehaving
shrinker from triggering endless scanning loops.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h |   2 +-
 mm/vmscan.c              | 100 ++++++++++++++++++++-------------------
 2 files changed, 53 insertions(+), 49 deletions(-)

Comments

Brian Foster Nov. 4, 2019, 3:29 p.m. UTC | #1

On Fri, Nov 01, 2019 at 10:46:02AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Right now deferred work is picked up by whatever GFP_KERNEL context
> reclaimer that wins the race to empty the node's deferred work
> counter. However, if there are lots of direct reclaimers, that
> work might be continually picked up by contexts taht can't do any
> work and so the opportunities to do the work are missed by contexts
> that could do them.
> 
> A further problem with the current code is that the deferred work
> can be picked up by a random direct reclaimer, resulting in that
> specific process having to do all the deferred reclaim work and
> hence can take extremely long latencies if the reclaim work blocks
> regularly. This is not good for direct reclaim fairness or for
> minimising long tail latency events.
> 
> To avoid these problems, simply limit deferred work to kswapd
> contexts. We know kswapd is a context that can always do reclaim
> work, and hence deferring work to kswapd allows the deferred work to
> be done in the background and not adversely affect any specific
> process context doing direct reclaim.
> 
> The advantage of this is that amount of work to be done in direct
> reclaim is now bound and predictable - it is entirely based on
> the cache's freeable objects and the reclaim priority. hence all
> direct reclaimers running at the same time should be doing
> relatively equal amounts of work, thereby reducing the incidence of
> long tail latencies due to uneven reclaim workloads.
> 
> Note that we use signed integers for everything except the freed
> count as the returns from the shrinker callouts cannot be guaranteed
> untainted. Indeed, the shrinkers can return scan counts larger that
> were fed in, so we need scan counts to underflow in a detectable
> manner to terminate loops. This is necessary to avoid a misbehaving
> shrinker from triggering endless scanning loops.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/shrinker.h |   2 +-
>  mm/vmscan.c              | 100 ++++++++++++++++++++-------------------
>  2 files changed, 53 insertions(+), 49 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 3405c39ab92c..30c10f42109f 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -81,7 +81,7 @@ struct shrinker {
>  	int id;
>  #endif
>  	/* objs pending delete, per node */
> -	atomic_long_t *nr_deferred;
> +	atomic64_t *nr_deferred;
>  };
>  #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2d39ec37c04d..c0e2bf656e3f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -517,16 +517,16 @@ static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
>  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  				    struct shrinker *shrinker, int priority)
>  {
> -	unsigned long freed = 0;
> -	long total_scan;
> +	uint64_t freed = 0;
>  	int64_t freeable_objects = 0;
>  	int64_t scan_count;
> -	long nr;
> -	long new_nr;
> +	int64_t scanned_objects = 0;
> +	int64_t next_deferred = 0;
> +	int64_t deferred_count = 0;
> +	int64_t new_nr;
>  	int nid = shrinkctl->nid;
>  	long batch_size = shrinker->batch ? shrinker->batch
>  					  : SHRINK_BATCH;
> -	long scanned = 0, next_deferred;
>  
>  	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
>  		nid = 0;
> @@ -537,47 +537,51 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		return scan_count;
>  
>  	/*
> -	 * copy the current shrinker scan count into a local variable
> -	 * and zero it so that other concurrent shrinker invocations
> -	 * don't also do this scanning work.
> -	 */
> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> -
> -	total_scan = nr + scan_count;
> -	if (total_scan < 0) {
> -		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> -		       shrinker->scan_objects, total_scan);
> -		total_scan = scan_count;
> -		next_deferred = nr;
> -	} else
> -		next_deferred = total_scan;
> -
> -	/*
> -	 * We need to avoid excessive windup on filesystem shrinkers
> -	 * due to large numbers of GFP_NOFS allocations causing the
> -	 * shrinkers to return -1 all the time. This results in a large
> -	 * nr being built up so when a shrink that can do some work
> -	 * comes along it empties the entire cache due to nr >>>
> -	 * freeable. This is bad for sustaining a working set in
> -	 * memory.
> +	 * If kswapd, we take all the deferred work and do it here. We don't let
> +	 * direct reclaim do this, because then it means some poor sod is going
> +	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
> +	 * amount of reclaim work from concurrent kswapd operations. Hence we do
> +	 * the work in the wrong place, at the wrong time, and it's largely
> +	 * unpredictable.
>  	 *
> -	 * Hence only allow the shrinker to scan the entire cache when
> -	 * a large delta change is calculated directly.
> +	 * By doing the deferred work only in kswapd, we can schedule the work
> +	 * according the the reclaim priority - low priority reclaim will do
> +	 * less deferred work, hence we'll do more of the deferred work the more
> +	 * desperate we become for free memory. This avoids the need for needing
> +	 * to specifically avoid deferred work windup as low amount os memory
> +	 * pressure won't excessive trim caches anymore.

That last sentence is hard to read. ;)

>  	 */
> -	if (scan_count < freeable_objects / 4)
> -		total_scan = min_t(long, total_scan, freeable_objects / 2);
> +	if (current_is_kswapd()) {
> +		int64_t	deferred_scan;
> +
> +		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
> +
> +		/* we want to scan 5-10% of the deferred work here at minimum */
> +		deferred_scan = deferred_count;
> +		if (priority)
> +			do_div(deferred_scan, priority);
> +		scan_count += deferred_scan;
> +
> +		/*
> +		 * If there is more deferred work than the number of freeable
> +		 * items in the cache, limit the amount of work we will carry
> +		 * over to the next kswapd run on this cache. This prevents
> +		 * deferred work windup.
> +		 */
> +		deferred_count = min(deferred_count, freeable_objects * 2);
> +

Extra whitespace above.

> +	}
>  
>  	/*
>  	 * Avoid risking looping forever due to too large nr value:
>  	 * never try to free more than twice the estimate number of
>  	 * freeable entries.
>  	 */

The comment refers to a variable that no longer exists.

I also wonder if it's a little cleaner to move the deferred_count =
min(...); statement above down here and condense the two comments.

> -	if (total_scan > freeable_objects * 2)
> -		total_scan = freeable_objects * 2;
> +	scan_count = min(scan_count, freeable_objects * 2);
>  
> -	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> +	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
>  				   freeable_objects, scan_count,
> -				   total_scan, priority);
> +				   scan_count, priority);
>  
>  	/*
>  	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> @@ -601,10 +605,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 * scanning at high prio and therefore should try to reclaim as much as
>  	 * possible.
>  	 */
> -	while (total_scan >= batch_size ||
> -	       total_scan >= freeable_objects) {
> +	while (scan_count >= batch_size ||
> +	       scan_count >= freeable_objects) {
>  		unsigned long ret;
> -		unsigned long nr_to_scan = min(batch_size, total_scan);
> +		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
>  
>  		shrinkctl->nr_to_scan = nr_to_scan;
>  		shrinkctl->nr_scanned = nr_to_scan;
> @@ -614,29 +618,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		freed += ret;
>  
>  		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
> -		total_scan -= shrinkctl->nr_scanned;
> -		scanned += shrinkctl->nr_scanned;
> +		scan_count -= shrinkctl->nr_scanned;
> +		scanned_objects += shrinkctl->nr_scanned;
>  
>  		cond_resched();
>  	}
> -
>  done:
> -	if (next_deferred >= scanned)
> -		next_deferred -= scanned;
> +	if (deferred_count)
> +		next_deferred = deferred_count - scanned_objects;
>  	else
> -		next_deferred = 0;
> +		next_deferred = scan_count;

Hmm.. so if there was no deferred count on this cycle, we set
next_deferred to whatever is left from scan_count and add that back into
the shrinker struct below. If there was a pending deferred count on this
cycle, we subtract what we scanned from that and add that value back.
But what happens to the remaining scan_count in the latter case? Is it
lost, or am I missing something?

For example, suppose we start this cycle with a large scan_count and
->scan_objects() returned SHRINK_STOP before doing much work. In that
scenario, it looks like whether ->nr_deferred is 0 or not is the only
thing that determines whether we defer the entire remaining scan_count
or just what is left from the previous ->nr_deferred. The existing code
appears to consistently factor in what is left from the current scan
with the previous deferred count. Hm?

>  	/*
>  	 * move the unused scan count back into the shrinker in a
>  	 * manner that handles concurrent updates. If we exhausted the
>  	 * scan, there is no need to do an update.
>  	 */
>  	if (next_deferred > 0)
> -		new_nr = atomic_long_add_return(next_deferred,
> +		new_nr = atomic64_add_return(next_deferred,
>  						&shrinker->nr_deferred[nid]);
>  	else
> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> +		new_nr = atomic64_read(&shrinker->nr_deferred[nid]);

It looks like we could kill new_nr and just reuse next_deferred here
too.

Brian

>  
> -	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
> +	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
> +					scan_count);
>  	return freed;
>  }
>  
> -- 
> 2.24.0.rc0
>

Dave Chinner Nov. 14, 2019, 9:11 p.m. UTC | #2

On Mon, Nov 04, 2019 at 10:29:54AM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:46:02AM +1100, Dave Chinner wrote:
> > @@ -601,10 +605,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  	 * scanning at high prio and therefore should try to reclaim as much as
> >  	 * possible.
> >  	 */
> > -	while (total_scan >= batch_size ||
> > -	       total_scan >= freeable_objects) {
> > +	while (scan_count >= batch_size ||
> > +	       scan_count >= freeable_objects) {
> >  		unsigned long ret;
> > -		unsigned long nr_to_scan = min(batch_size, total_scan);
> > +		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
> >  
> >  		shrinkctl->nr_to_scan = nr_to_scan;
> >  		shrinkctl->nr_scanned = nr_to_scan;
> > @@ -614,29 +618,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  		freed += ret;
> >  
> >  		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
> > -		total_scan -= shrinkctl->nr_scanned;
> > -		scanned += shrinkctl->nr_scanned;
> > +		scan_count -= shrinkctl->nr_scanned;
> > +		scanned_objects += shrinkctl->nr_scanned;
> >  
> >  		cond_resched();
> >  	}
> > -
> >  done:
> > -	if (next_deferred >= scanned)
> > -		next_deferred -= scanned;
> > +	if (deferred_count)
> > +		next_deferred = deferred_count - scanned_objects;
> >  	else
> > -		next_deferred = 0;
> > +		next_deferred = scan_count;
> 
> Hmm.. so if there was no deferred count on this cycle, we set
> next_deferred to whatever is left from scan_count and add that back into
> the shrinker struct below. If there was a pending deferred count on this
> cycle, we subtract what we scanned from that and add that value back.
> But what happens to the remaining scan_count in the latter case? Is it
> lost, or am I missing something?

if deferred_count is not zero, then it is kswapd that is running. It
does the deferred work, and if it doesn't make progress then adding
it's scan count to the deferred work doesn't matter. That's because
it will come back with an increased priority in a short while and
try to scan more of the deferred count plus it's larger scan count.

IOWs, if we defer kswapd unused scan count, we effectively increase
the pressure as the priority goes up, potentially making the
deferred count increase out of control. i.e. kswapd can make
progress and free items, but the result is that it increased the
deferred scan count rather than reducing it. This leads to excessive
reclaim of the slab caches and kswapd can trash the caches long
after the memory pressure has gone away...

> For example, suppose we start this cycle with a large scan_count and
> ->scan_objects() returned SHRINK_STOP before doing much work. In that
> scenario, it looks like whether ->nr_deferred is 0 or not is the only
> thing that determines whether we defer the entire remaining scan_count
> or just what is left from the previous ->nr_deferred. The existing code
> appears to consistently factor in what is left from the current scan
> with the previous deferred count. Hm?

If kswapd doesn't have any deferred work, then it's largely no
different in behaviour to direct reclaim. If it has no deferred
work, then the shrinker is not getting stopped early in direct
reclaim, so it's unlikely that kswapd is going to get stopped early,
either....

Cheers,

Dave.

Brian Foster Nov. 15, 2019, 5:23 p.m. UTC | #3

On Fri, Nov 15, 2019 at 08:11:50AM +1100, Dave Chinner wrote:
> On Mon, Nov 04, 2019 at 10:29:54AM -0500, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:46:02AM +1100, Dave Chinner wrote:
> > > @@ -601,10 +605,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  	 * scanning at high prio and therefore should try to reclaim as much as
> > >  	 * possible.
> > >  	 */
> > > -	while (total_scan >= batch_size ||
> > > -	       total_scan >= freeable_objects) {
> > > +	while (scan_count >= batch_size ||
> > > +	       scan_count >= freeable_objects) {
> > >  		unsigned long ret;
> > > -		unsigned long nr_to_scan = min(batch_size, total_scan);
> > > +		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
> > >  
> > >  		shrinkctl->nr_to_scan = nr_to_scan;
> > >  		shrinkctl->nr_scanned = nr_to_scan;
> > > @@ -614,29 +618,29 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  		freed += ret;
> > >  
> > >  		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
> > > -		total_scan -= shrinkctl->nr_scanned;
> > > -		scanned += shrinkctl->nr_scanned;
> > > +		scan_count -= shrinkctl->nr_scanned;
> > > +		scanned_objects += shrinkctl->nr_scanned;
> > >  
> > >  		cond_resched();
> > >  	}
> > > -
> > >  done:
> > > -	if (next_deferred >= scanned)
> > > -		next_deferred -= scanned;
> > > +	if (deferred_count)
> > > +		next_deferred = deferred_count - scanned_objects;
> > >  	else
> > > -		next_deferred = 0;
> > > +		next_deferred = scan_count;
> > 
> > Hmm.. so if there was no deferred count on this cycle, we set
> > next_deferred to whatever is left from scan_count and add that back into
> > the shrinker struct below. If there was a pending deferred count on this
> > cycle, we subtract what we scanned from that and add that value back.
> > But what happens to the remaining scan_count in the latter case? Is it
> > lost, or am I missing something?
> 
> if deferred_count is not zero, then it is kswapd that is running. It
> does the deferred work, and if it doesn't make progress then adding
> it's scan count to the deferred work doesn't matter. That's because
> it will come back with an increased priority in a short while and
> try to scan more of the deferred count plus it's larger scan count.
> 

Ok, so perhaps there is no functional reason to defer remaining scan
count from a context (i.e. kswapd) that attempts to process deferred
work...

> IOWs, if we defer kswapd unused scan count, we effectively increase
> the pressure as the priority goes up, potentially making the
> deferred count increase out of control. i.e. kswapd can make
> progress and free items, but the result is that it increased the
> deferred scan count rather than reducing it. This leads to excessive
> reclaim of the slab caches and kswapd can trash the caches long
> after the memory pressure has gone away...
> 

... yet if kswapd runs without pre-existing deferred work, that's
precisely what it does. next_deferred is set to remaining scan_count and
that is added back to the shrinker struct. So should kswapd generally
defer work or not? If the answer is sometimes, then please add a comment
to the next_deferred assignment to explain when/why.

> > For example, suppose we start this cycle with a large scan_count and
> > ->scan_objects() returned SHRINK_STOP before doing much work. In that
> > scenario, it looks like whether ->nr_deferred is 0 or not is the only
> > thing that determines whether we defer the entire remaining scan_count
> > or just what is left from the previous ->nr_deferred. The existing code
> > appears to consistently factor in what is left from the current scan
> > with the previous deferred count. Hm?
> 
> If kswapd doesn't have any deferred work, then it's largely no
> different in behaviour to direct reclaim. If it has no deferred
> work, then the shrinker is not getting stopped early in direct
> reclaim, so it's unlikely that kswapd is going to get stopped early,
> either....
> 

Then perhaps the logic could be simplified to explicitly not defer from
kswapd..?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 3405c39ab92c..30c10f42109f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -81,7 +81,7 @@  struct shrinker {
 	int id;
 #endif
 	/* objs pending delete, per node */
-	atomic_long_t *nr_deferred;
+	atomic64_t *nr_deferred;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d39ec37c04d..c0e2bf656e3f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -517,16 +517,16 @@  static int64_t shrink_scan_count(struct shrink_control *shrinkctl,
 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 				    struct shrinker *shrinker, int priority)
 {
-	unsigned long freed = 0;
-	long total_scan;
+	uint64_t freed = 0;
 	int64_t freeable_objects = 0;
 	int64_t scan_count;
-	long nr;
-	long new_nr;
+	int64_t scanned_objects = 0;
+	int64_t next_deferred = 0;
+	int64_t deferred_count = 0;
+	int64_t new_nr;
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
 					  : SHRINK_BATCH;
-	long scanned = 0, next_deferred;
 
 	if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 		nid = 0;
@@ -537,47 +537,51 @@  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		return scan_count;
 
 	/*
-	 * copy the current shrinker scan count into a local variable
-	 * and zero it so that other concurrent shrinker invocations
-	 * don't also do this scanning work.
-	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
-
-	total_scan = nr + scan_count;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = scan_count;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
-
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
+	 * If kswapd, we take all the deferred work and do it here. We don't let
+	 * direct reclaim do this, because then it means some poor sod is going
+	 * to have to do somebody else's GFP_NOFS reclaim, and it hides the real
+	 * amount of reclaim work from concurrent kswapd operations. Hence we do
+	 * the work in the wrong place, at the wrong time, and it's largely
+	 * unpredictable.
 	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
+	 * By doing the deferred work only in kswapd, we can schedule the work
+	 * according the the reclaim priority - low priority reclaim will do
+	 * less deferred work, hence we'll do more of the deferred work the more
+	 * desperate we become for free memory. This avoids the need for needing
+	 * to specifically avoid deferred work windup as low amount os memory
+	 * pressure won't excessive trim caches anymore.
 	 */
-	if (scan_count < freeable_objects / 4)
-		total_scan = min_t(long, total_scan, freeable_objects / 2);
+	if (current_is_kswapd()) {
+		int64_t	deferred_scan;
+
+		deferred_count = atomic64_xchg(&shrinker->nr_deferred[nid], 0);
+
+		/* we want to scan 5-10% of the deferred work here at minimum */
+		deferred_scan = deferred_count;
+		if (priority)
+			do_div(deferred_scan, priority);
+		scan_count += deferred_scan;
+
+		/*
+		 * If there is more deferred work than the number of freeable
+		 * items in the cache, limit the amount of work we will carry
+		 * over to the next kswapd run on this cache. This prevents
+		 * deferred work windup.
+		 */
+		deferred_count = min(deferred_count, freeable_objects * 2);
+
+	}
 
 	/*
 	 * Avoid risking looping forever due to too large nr value:
 	 * never try to free more than twice the estimate number of
 	 * freeable entries.
 	 */
-	if (total_scan > freeable_objects * 2)
-		total_scan = freeable_objects * 2;
+	scan_count = min(scan_count, freeable_objects * 2);
 
-	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
+	trace_mm_shrink_slab_start(shrinker, shrinkctl, deferred_count,
 				   freeable_objects, scan_count,
-				   total_scan, priority);
+				   scan_count, priority);
 
 	/*
 	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
@@ -601,10 +605,10 @@  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * scanning at high prio and therefore should try to reclaim as much as
 	 * possible.
 	 */
-	while (total_scan >= batch_size ||
-	       total_scan >= freeable_objects) {
+	while (scan_count >= batch_size ||
+	       scan_count >= freeable_objects) {
 		unsigned long ret;
-		unsigned long nr_to_scan = min(batch_size, total_scan);
+		unsigned long nr_to_scan = min_t(long, batch_size, scan_count);
 
 		shrinkctl->nr_to_scan = nr_to_scan;
 		shrinkctl->nr_scanned = nr_to_scan;
@@ -614,29 +618,29 @@  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		freed += ret;
 
 		count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned);
-		total_scan -= shrinkctl->nr_scanned;
-		scanned += shrinkctl->nr_scanned;
+		scan_count -= shrinkctl->nr_scanned;
+		scanned_objects += shrinkctl->nr_scanned;
 
 		cond_resched();
 	}
-
 done:
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
+	if (deferred_count)
+		next_deferred = deferred_count - scanned_objects;
 	else
-		next_deferred = 0;
+		next_deferred = scan_count;
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates. If we exhausted the
 	 * scan, there is no need to do an update.
 	 */
 	if (next_deferred > 0)
-		new_nr = atomic_long_add_return(next_deferred,
+		new_nr = atomic64_add_return(next_deferred,
 						&shrinker->nr_deferred[nid]);
 	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+		new_nr = atomic64_read(&shrinker->nr_deferred[nid]);
 
-	trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
+	trace_mm_shrink_slab_end(shrinker, nid, freed, deferred_count, new_nr,
+					scan_count);
 	return freed;
 }

[12/28] shrinker: defer work only to kswapd

Commit Message

Comments

Patch