[01/24] mm: directed shrinker work deferral

Message ID	20190801021752.4986-2-david@fromorbit.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 01/24] mm: directed shrinker work deferral Date: Thu, 1 Aug 2019 12:17:29 +1000 Message-Id: <20190801021752.4986-2-david@fromorbit.com> In-Reply-To: <20190801021752.4986-1-david@fromorbit.com> References: <20190801021752.4986-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	mm, xfs: non-blocking inode reclaim \| expand [RFC,00/24] mm, xfs: non-blocking inode reclaim [01/24] mm: directed shrinker work deferral [02/24] shrinkers: use will_defer for GFP_NOFS sensitive shrinkers [03/24] mm: factor shrinker work calculations [04/24] shrinker: defer work only to kswapd [05/24] shrinker: clean up variable types and tracepoints [06/24] mm: reclaim_state records pages reclaimed, not slabs [07/24] mm: back off direct reclaim on excessive shrinker deferral [08/24] mm: kswapd backoff for shrinkers [09/24] xfs: don't allow log IO to be throttled [10/24] xfs: fix missed wakeup on l_flush_wait [11/24] xfs:: account for memory freed from metadata buffers [12/24] xfs: correctly acount for reclaimable slabs [13/24] xfs: synchronous AIL pushing [14/24] xfs: tail updates only need to occur when LSN changes [15/24] xfs: eagerly free shadow buffers to reduce CIL footprint [16/24] xfs: Lower CIL flush limit for large logs [17/24] xfs: don't block kswapd in inode reclaim [18/24] xfs: reduce kswapd blocking on inode locking. [19/24] xfs: kill background reclaim work [20/24] xfs: use AIL pushing for inode reclaim IO [21/24] xfs: remove mode from xfs_reclaim_inodes() [22/24] xfs: track reclaimable inodes using a LRU list [23/24] xfs: reclaim inodes from the LRU [24/24] xfs: remove unusued old inode reclaim code

Dave Chinner Aug. 1, 2019, 2:17 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

Introduce a mechanism for ->count_objects() to indicate to the
shrinker infrastructure that the reclaim context will not allow
scanning work to be done and so the work it decides is necessary
needs to be deferred.

This simplifies the code by separating out the accounting of
deferred work from the actual doing of the work, and allows better
decisions to be made by the shrinekr control logic on what action it
can take.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/shrinker.h | 7 +++++++
 mm/vmscan.c              | 8 ++++++++
 2 files changed, 15 insertions(+)

Brian Foster Aug. 2, 2019, 3:27 p.m. UTC | #1

On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Introduce a mechanism for ->count_objects() to indicate to the
> shrinker infrastructure that the reclaim context will not allow
> scanning work to be done and so the work it decides is necessary
> needs to be deferred.
> 
> This simplifies the code by separating out the accounting of
> deferred work from the actual doing of the work, and allows better
> decisions to be made by the shrinekr control logic on what action it
> can take.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/linux/shrinker.h | 7 +++++++
>  mm/vmscan.c              | 8 ++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 9443cafd1969..af78c475fc32 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -31,6 +31,13 @@ struct shrink_control {
>  
>  	/* current memcg being shrunk (for memcg aware shrinkers) */
>  	struct mem_cgroup *memcg;
> +
> +	/*
> +	 * set by ->count_objects if reclaim context prevents reclaim from
> +	 * occurring. This allows the shrinker to immediately defer all the
> +	 * work and not even attempt to scan the cache.
> +	 */
> +	bool will_defer;

Functionality wise this seems fairly straightforward. FWIW, I find the
'will_defer' name a little confusing because it implies to me that the
shrinker is telling the caller about something it would do if called as
opposed to explicitly telling the caller to defer. I'd just call it
'defer' I guess, but that's just my .02. ;P

>  };
>  
>  #define SHRINK_STOP (~0UL)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 44df66a98f2a..ae3035fe94bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
>  				   freeable, delta, total_scan, priority);
>  
> +	/*
> +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> +	 * defer the work to a context that can scan the cache.
> +	 */
> +	if (shrinkctl->will_defer)
> +		goto done;
> +

Who's responsible for clearing the flag? Perhaps we should do so here
once it's acted upon since we don't call into the shrinker again?

Note that I see this structure is reinitialized on every iteration in
the caller, but there already is the SHRINK_EMPTY case where we call
back into do_shrink_slab(). Granted the deferred state likely hasn't
changed, but the fact that we'd call back into the count callback to set
it again implies the logic could be a bit more explicit, particularly if
this will eventually be used for more dynamic shrinker state that might
change call to call (i.e., object dirty state, etc.).

BTW, do we need to care about the ->nr_cached_objects() call from the
generic superblock shrinker (super_cache_scan())?

Brian

>  	/*
>  	 * Normally, we should not scan less than batch_size objects in one
>  	 * pass to avoid too frequent shrinker calls, but if the slab has less
> @@ -575,6 +582,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		cond_resched();
>  	}
>  
> +done:
>  	if (next_deferred >= scanned)
>  		next_deferred -= scanned;
>  	else
> -- 
> 2.22.0
>

Dave Chinner Aug. 4, 2019, 1:49 a.m. UTC | #2

On Fri, Aug 02, 2019 at 11:27:09AM -0400, Brian Foster wrote:
> On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Introduce a mechanism for ->count_objects() to indicate to the
> > shrinker infrastructure that the reclaim context will not allow
> > scanning work to be done and so the work it decides is necessary
> > needs to be deferred.
> > 
> > This simplifies the code by separating out the accounting of
> > deferred work from the actual doing of the work, and allows better
> > decisions to be made by the shrinekr control logic on what action it
> > can take.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  include/linux/shrinker.h | 7 +++++++
> >  mm/vmscan.c              | 8 ++++++++
> >  2 files changed, 15 insertions(+)
> > 
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 9443cafd1969..af78c475fc32 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -31,6 +31,13 @@ struct shrink_control {
> >  
> >  	/* current memcg being shrunk (for memcg aware shrinkers) */
> >  	struct mem_cgroup *memcg;
> > +
> > +	/*
> > +	 * set by ->count_objects if reclaim context prevents reclaim from
> > +	 * occurring. This allows the shrinker to immediately defer all the
> > +	 * work and not even attempt to scan the cache.
> > +	 */
> > +	bool will_defer;
> 
> Functionality wise this seems fairly straightforward. FWIW, I find the
> 'will_defer' name a little confusing because it implies to me that the
> shrinker is telling the caller about something it would do if called as
> opposed to explicitly telling the caller to defer. I'd just call it
> 'defer' I guess, but that's just my .02. ;P

Ok, I'll change it to something like "defer_work" or "defer_scan"
here.

> >  };
> >  
> >  #define SHRINK_STOP (~0UL)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 44df66a98f2a..ae3035fe94bc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> >  				   freeable, delta, total_scan, priority);
> >  
> > +	/*
> > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > +	 * defer the work to a context that can scan the cache.
> > +	 */
> > +	if (shrinkctl->will_defer)
> > +		goto done;
> > +
> 
> Who's responsible for clearing the flag? Perhaps we should do so here
> once it's acted upon since we don't call into the shrinker again?

Each shrinker invocation has it's own shrink_control context - they
are not shared between shrinkers - the higher level is responsible
for setting up the control state of each individual shrinker
invocation...

> Note that I see this structure is reinitialized on every iteration in
> the caller, but there already is the SHRINK_EMPTY case where we call
> back into do_shrink_slab().

.... because there is external state tracking in memcgs that
determine what shrinkers get run. See shrink_slab_memcg().

i.e. The SHRINK_EMPTY return value is a special hack for memcg
shrinkers so it can track whether there are freeable objects in the
cache externally to try to avoid calling into shrinkers where no
work can be done.  Think about having hundreds of shrinkers and
hundreds of memcgs...

Anyway, the tracking of the freeable bit is racy, so the
SHRINK_EMPTY hack where it clears the bit and calls back into the
shrinker is handling the case where objects were freed between the
shrinker running and shrink_slab_memcg() clearing the freeable bit
from the slab. Hence it has to call back into the shrinker again -
if it gets anything other than SHRINK_EMPTY returned, then it will
set the bit again.

In reality, SHRINK_EMPTY and deferring work are mutually exclusive.
Work only gets deferred when there's work that can be done and in
that case SHRINK_EMPTY will not be returned - a value of "0 freed
objects" will be returned when we defer work. So if the first call
returns SHRINK_EMPTY, the "defer" state has not been touched and
so doesn't require resetting to zero here.

> Granted the deferred state likely hasn't
> changed, but the fact that we'd call back into the count callback to set
> it again implies the logic could be a bit more explicit, particularly if
> this will eventually be used for more dynamic shrinker state that might
> change call to call (i.e., object dirty state, etc.).
> 
> BTW, do we need to care about the ->nr_cached_objects() call from the
> generic superblock shrinker (super_cache_scan())?

No, and we never had to because it is inside the superblock shrinker
and the superblock shrinker does the GFP_NOFS context checks.

Cheers,

Dave.

Brian Foster Aug. 5, 2019, 5:42 p.m. UTC | #3

On Sun, Aug 04, 2019 at 11:49:30AM +1000, Dave Chinner wrote:
> On Fri, Aug 02, 2019 at 11:27:09AM -0400, Brian Foster wrote:
> > On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Introduce a mechanism for ->count_objects() to indicate to the
> > > shrinker infrastructure that the reclaim context will not allow
> > > scanning work to be done and so the work it decides is necessary
> > > needs to be deferred.
> > > 
> > > This simplifies the code by separating out the accounting of
> > > deferred work from the actual doing of the work, and allows better
> > > decisions to be made by the shrinekr control logic on what action it
> > > can take.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  include/linux/shrinker.h | 7 +++++++
> > >  mm/vmscan.c              | 8 ++++++++
> > >  2 files changed, 15 insertions(+)
> > > 
> > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > index 9443cafd1969..af78c475fc32 100644
> > > --- a/include/linux/shrinker.h
> > > +++ b/include/linux/shrinker.h
> > > @@ -31,6 +31,13 @@ struct shrink_control {
> > >  
> > >  	/* current memcg being shrunk (for memcg aware shrinkers) */
> > >  	struct mem_cgroup *memcg;
> > > +
> > > +	/*
> > > +	 * set by ->count_objects if reclaim context prevents reclaim from
> > > +	 * occurring. This allows the shrinker to immediately defer all the
> > > +	 * work and not even attempt to scan the cache.
> > > +	 */
> > > +	bool will_defer;
> > 
> > Functionality wise this seems fairly straightforward. FWIW, I find the
> > 'will_defer' name a little confusing because it implies to me that the
> > shrinker is telling the caller about something it would do if called as
> > opposed to explicitly telling the caller to defer. I'd just call it
> > 'defer' I guess, but that's just my .02. ;P
> 
> Ok, I'll change it to something like "defer_work" or "defer_scan"
> here.
> 

Either sounds better to me, thanks.

> > >  };
> > >  
> > >  #define SHRINK_STOP (~0UL)
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 44df66a98f2a..ae3035fe94bc 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > >  				   freeable, delta, total_scan, priority);
> > >  
> > > +	/*
> > > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > > +	 * defer the work to a context that can scan the cache.
> > > +	 */
> > > +	if (shrinkctl->will_defer)
> > > +		goto done;
> > > +
> > 
> > Who's responsible for clearing the flag? Perhaps we should do so here
> > once it's acted upon since we don't call into the shrinker again?
> 
> Each shrinker invocation has it's own shrink_control context - they
> are not shared between shrinkers - the higher level is responsible
> for setting up the control state of each individual shrinker
> invocation...
> 

Yes, but more specifically, it appears to me that each level is
responsible for setting up control state managed by that level. E.g.,
shrink_slab_memcg() initializes the unchanging state per iteration and
do_shrink_slab() (re)sets the scan state prior to ->scan_objects().

> > Note that I see this structure is reinitialized on every iteration in
> > the caller, but there already is the SHRINK_EMPTY case where we call
> > back into do_shrink_slab().
> 
> .... because there is external state tracking in memcgs that
> determine what shrinkers get run. See shrink_slab_memcg().
> 
> i.e. The SHRINK_EMPTY return value is a special hack for memcg
> shrinkers so it can track whether there are freeable objects in the
> cache externally to try to avoid calling into shrinkers where no
> work can be done.  Think about having hundreds of shrinkers and
> hundreds of memcgs...
> 
> Anyway, the tracking of the freeable bit is racy, so the
> SHRINK_EMPTY hack where it clears the bit and calls back into the
> shrinker is handling the case where objects were freed between the
> shrinker running and shrink_slab_memcg() clearing the freeable bit
> from the slab. Hence it has to call back into the shrinker again -
> if it gets anything other than SHRINK_EMPTY returned, then it will
> set the bit again.
> 

Yeah, I grokked most of that from the code. The current implementation
looks fine to me, but I could easily see how changes in the higher level
do_shrink_slab() caller(s) or lower level shrinker callbacks could
quietly break this in the future. IOW, once this code hits the tree any
shrinker across the kernel is free to try and defer slab reclaim work
for any reason.

> In reality, SHRINK_EMPTY and deferring work are mutually exclusive.
> Work only gets deferred when there's work that can be done and in
> that case SHRINK_EMPTY will not be returned - a value of "0 freed
> objects" will be returned when we defer work. So if the first call
> returns SHRINK_EMPTY, the "defer" state has not been touched and
> so doesn't require resetting to zero here.
> 

Yep. The high level semantics make sense, but note that that the generic
superblock shrinker can now set ->will_defer true and return
SHRINK_EMPTY so that last bit about defer state not being touched is not
technically true.

> > Granted the deferred state likely hasn't
> > changed, but the fact that we'd call back into the count callback to set
> > it again implies the logic could be a bit more explicit, particularly if
> > this will eventually be used for more dynamic shrinker state that might
> > change call to call (i.e., object dirty state, etc.).
> > 
> > BTW, do we need to care about the ->nr_cached_objects() call from the
> > generic superblock shrinker (super_cache_scan())?
> 
> No, and we never had to because it is inside the superblock shrinker
> and the superblock shrinker does the GFP_NOFS context checks.
> 

Ok. Though tbh this topic has me wondering whether a shrink_control
boolean is the right approach here. Do you envision ->will_defer being
used for anything other than allocation context restrictions? If not,
perhaps we should do something like optionally set alloc flags required
for direct scanning in the struct shrinker itself and let the core
shrinker code decide when to defer to kswapd based on the shrink_control
flags and the current shrinker. That way an arbitrary shrinker can't
muck around with core behavior in unintended ways. Hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner Aug. 5, 2019, 11:43 p.m. UTC | #4

On Mon, Aug 05, 2019 at 01:42:26PM -0400, Brian Foster wrote:
> On Sun, Aug 04, 2019 at 11:49:30AM +1000, Dave Chinner wrote:
> > On Fri, Aug 02, 2019 at 11:27:09AM -0400, Brian Foster wrote:
> > > On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> > > >  };
> > > >  
> > > >  #define SHRINK_STOP (~0UL)
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 44df66a98f2a..ae3035fe94bc 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > > >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > > >  				   freeable, delta, total_scan, priority);
> > > >  
> > > > +	/*
> > > > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > > > +	 * defer the work to a context that can scan the cache.
> > > > +	 */
> > > > +	if (shrinkctl->will_defer)
> > > > +		goto done;
> > > > +
> > > 
> > > Who's responsible for clearing the flag? Perhaps we should do so here
> > > once it's acted upon since we don't call into the shrinker again?
> > 
> > Each shrinker invocation has it's own shrink_control context - they
> > are not shared between shrinkers - the higher level is responsible
> > for setting up the control state of each individual shrinker
> > invocation...
> > 
> 
> Yes, but more specifically, it appears to me that each level is
> responsible for setting up control state managed by that level. E.g.,
> shrink_slab_memcg() initializes the unchanging state per iteration and
> do_shrink_slab() (re)sets the scan state prior to ->scan_objects().

do_shrink_slab() is responsible for iterating the scan in
shrinker->batch sizes, that's all it's doing there. We have to do
some accounting work from scan to scan. However, if ->will_defer is
set, we skip that entire loop, so it's largely irrelevant IMO.

> > > Granted the deferred state likely hasn't
> > > changed, but the fact that we'd call back into the count callback to set
> > > it again implies the logic could be a bit more explicit, particularly if
> > > this will eventually be used for more dynamic shrinker state that might
> > > change call to call (i.e., object dirty state, etc.).
> > > 
> > > BTW, do we need to care about the ->nr_cached_objects() call from the
> > > generic superblock shrinker (super_cache_scan())?
> > 
> > No, and we never had to because it is inside the superblock shrinker
> > and the superblock shrinker does the GFP_NOFS context checks.
> > 
> 
> Ok. Though tbh this topic has me wondering whether a shrink_control
> boolean is the right approach here. Do you envision ->will_defer being
> used for anything other than allocation context restrictions? If not,

Not at this point. If there are other control flags needed, we can
ad them in future - I don't like the idea of having a single control
flag mean different things in different contexts.

> perhaps we should do something like optionally set alloc flags required
> for direct scanning in the struct shrinker itself and let the core
> shrinker code decide when to defer to kswapd based on the shrink_control
> flags and the current shrinker. That way an arbitrary shrinker can't
> muck around with core behavior in unintended ways. Hm?

Arbitrary shrinkers can't "muck about" with the core behaviour any
more than they already could with this code. If you want to screw up
the core reclaim by always returning SHRINK_STOP to ->scan_objects
instead of doing work, then there is nothing stopping you from doing
that right now. Formalising there work deferral into a flag in the
shrink_control doesn't really change that at all, adn as such I
don't see any need for over-complicating the mechanism here....

Cheers,

Dave.

Brian Foster Aug. 6, 2019, 12:27 p.m. UTC | #5

On Tue, Aug 06, 2019 at 09:43:18AM +1000, Dave Chinner wrote:
> On Mon, Aug 05, 2019 at 01:42:26PM -0400, Brian Foster wrote:
> > On Sun, Aug 04, 2019 at 11:49:30AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 02, 2019 at 11:27:09AM -0400, Brian Foster wrote:
> > > > On Thu, Aug 01, 2019 at 12:17:29PM +1000, Dave Chinner wrote:
> > > > >  };
> > > > >  
> > > > >  #define SHRINK_STOP (~0UL)
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 44df66a98f2a..ae3035fe94bc 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -541,6 +541,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > > > >  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > > > >  				   freeable, delta, total_scan, priority);
> > > > >  
> > > > > +	/*
> > > > > +	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > > > > +	 * defer the work to a context that can scan the cache.
> > > > > +	 */
> > > > > +	if (shrinkctl->will_defer)
> > > > > +		goto done;
> > > > > +
> > > > 
> > > > Who's responsible for clearing the flag? Perhaps we should do so here
> > > > once it's acted upon since we don't call into the shrinker again?
> > > 
> > > Each shrinker invocation has it's own shrink_control context - they
> > > are not shared between shrinkers - the higher level is responsible
> > > for setting up the control state of each individual shrinker
> > > invocation...
> > > 
> > 
> > Yes, but more specifically, it appears to me that each level is
> > responsible for setting up control state managed by that level. E.g.,
> > shrink_slab_memcg() initializes the unchanging state per iteration and
> > do_shrink_slab() (re)sets the scan state prior to ->scan_objects().
> 
> do_shrink_slab() is responsible for iterating the scan in
> shrinker->batch sizes, that's all it's doing there. We have to do
> some accounting work from scan to scan. However, if ->will_defer is
> set, we skip that entire loop, so it's largely irrelevant IMO.
> 

The point is very simply that there are scenarios where ->will_defer
might be true or might be false on do_shrink_slab() entry and I'm just
noting it as a potential landmine. It's not a bug in the current code
from what I can tell. I can't imagine why we wouldn't just reset the
flag prior to the ->count_objects() call, but alas I'm not a maintainer
of this code so I'll leave it to other reviewers/maintainers at this
point..

> > > > Granted the deferred state likely hasn't
> > > > changed, but the fact that we'd call back into the count callback to set
> > > > it again implies the logic could be a bit more explicit, particularly if
> > > > this will eventually be used for more dynamic shrinker state that might
> > > > change call to call (i.e., object dirty state, etc.).
> > > > 
> > > > BTW, do we need to care about the ->nr_cached_objects() call from the
> > > > generic superblock shrinker (super_cache_scan())?
> > > 
> > > No, and we never had to because it is inside the superblock shrinker
> > > and the superblock shrinker does the GFP_NOFS context checks.
> > > 
> > 
> > Ok. Though tbh this topic has me wondering whether a shrink_control
> > boolean is the right approach here. Do you envision ->will_defer being
> > used for anything other than allocation context restrictions? If not,
> 
> Not at this point. If there are other control flags needed, we can
> ad them in future - I don't like the idea of having a single control
> flag mean different things in different contexts.
> 

I don't think we're talking about the same thing here..

> > perhaps we should do something like optionally set alloc flags required
> > for direct scanning in the struct shrinker itself and let the core
> > shrinker code decide when to defer to kswapd based on the shrink_control
> > flags and the current shrinker. That way an arbitrary shrinker can't
> > muck around with core behavior in unintended ways. Hm?
> 
> Arbitrary shrinkers can't "muck about" with the core behaviour any
> more than they already could with this code. If you want to screw up
> the core reclaim by always returning SHRINK_STOP to ->scan_objects
> instead of doing work, then there is nothing stopping you from doing
> that right now. Formalising there work deferral into a flag in the
> shrink_control doesn't really change that at all, adn as such I
> don't see any need for over-complicating the mechanism here....
> 

If you add a generic "defer work" knob to the shrinker mechanism, but
only process it as an "allocation context" check, I expect it could be
easily misused. For example, some shrinkers may decide to set the the
flag dynamically based on in-core state. This will work when called from
some contexts but not from others (unrelated to allocation context),
which is confusing. Therefore, what I'm saying is that if the only
current use case is to defer work from shrinkers that currently skip
work due to allocation context restraints, this might be better codified
with something like the appended (untested) example patch. This may or
may not be a preferable interface to the flag, but it's certainly not an
overcomplication...

Brian

--- 8< ---

diff --git a/fs/super.c b/fs/super.c
index 113c58f19425..4e05ed9d6154 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -69,13 +69,6 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
-	/*
-	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
-	 * to recurse into the FS that called us in clear_inode() and friends..
-	 */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	if (!trylock_super(sb))
 		return SHRINK_STOP;
 
@@ -264,6 +257,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
 	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
+	s->s_shrink.direct_mask = __GFP_FS;
 	if (prealloc_shrinker(&s->s_shrink))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink))
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 9443cafd1969..e94e4edf7f1e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -75,6 +75,8 @@ struct shrinker {
 #endif
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
+
+	gfp_t	direct_mask;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44df66a98f2a..fb339399e26a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -541,6 +541,15 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 				   freeable, delta, total_scan, priority);
 
+	/*
+	 * If the shrinker can't run (e.g. due to gfp_mask constraints), then
+	 * defer the work to a context that can scan the cache.
+	 */
+	if (shrinker->direct_mask &&
+	    ((shrinkctl->gfp_mask & shrinker->direct_mask) !=
+	     shrinker->direct_mask))
+		goto done;
+
 	/*
 	 * Normally, we should not scan less than batch_size objects in one
 	 * pass to avoid too frequent shrinker calls, but if the slab has less
@@ -575,6 +584,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 
+done:
 	if (next_deferred >= scanned)
 		next_deferred -= scanned;
 	else

Dave Chinner Aug. 6, 2019, 10:22 p.m. UTC | #6

On Tue, Aug 06, 2019 at 08:27:54AM -0400, Brian Foster wrote:
> If you add a generic "defer work" knob to the shrinker mechanism, but
> only process it as an "allocation context" check, I expect it could be
> easily misused. For example, some shrinkers may decide to set the the
> flag dynamically based on in-core state.

Which is already the case. e.g. There are shrinkers that don't do
anything because a try-lock fails.  I haven't attempted to change
them, but they are a clear example of how even ->scan_object to
->scan_object the shrinker context can change. 

> This will work when called from
> some contexts but not from others (unrelated to allocation context),
> which is confusing. Therefore, what I'm saying is that if the only
> current use case is to defer work from shrinkers that currently skip
> work due to allocation context restraints, this might be better codified
> with something like the appended (untested) example patch. This may or
> may not be a preferable interface to the flag, but it's certainly not an
> overcomplication...

I don't think this is the right way to go.

I want the filesystem shrinkers to become entirely non-blocking so
that we can dynamically decide on an object-by-object basis whether
we can reclaim the object in GFP_NOFS context.

That is, a clean XFS inode that requires no special cleanup can be
reclaimed even in GFP_NOFS context. The problem we have is that
dentry reclaim can drop the last reference to an inode, causing
inactivation and hence modification. However, if it's only going to
move to the inode LRU and not evict the inode, we can reclaim that
dentry. Similarly for inodes - if evicting the inode is not going to
block or modify the inode, we can reclaim the inode even under
GFP_NOFS constraints. And the same for XFS indoes - it if's clean
we can reclaim it, GFP_NOFS context or not.

IMO, that's the direction we need to be heading in, and in those
cases the "deferred work" tends towards a count of objects we could
not reclaim during the scan because they require blocking work to be
done. i.e. deferred work is a boolean now because the GFP_NOFS
decision is boolean, but it's lays the ground work for deferred work
to be integrated at a much finer-grained level in the shrinker
scanning routines in future...

Cheers,

Dave.

Brian Foster Aug. 7, 2019, 11:13 a.m. UTC | #7

On Wed, Aug 07, 2019 at 08:22:20AM +1000, Dave Chinner wrote:
> On Tue, Aug 06, 2019 at 08:27:54AM -0400, Brian Foster wrote:
> > If you add a generic "defer work" knob to the shrinker mechanism, but
> > only process it as an "allocation context" check, I expect it could be
> > easily misused. For example, some shrinkers may decide to set the the
> > flag dynamically based on in-core state.
> 
> Which is already the case. e.g. There are shrinkers that don't do
> anything because a try-lock fails.  I haven't attempted to change
> them, but they are a clear example of how even ->scan_object to
> ->scan_object the shrinker context can change. 
> 

That's a similar point to what I'm trying to make wrt to
->count_objects() and the new defer state..

> > This will work when called from
> > some contexts but not from others (unrelated to allocation context),
> > which is confusing. Therefore, what I'm saying is that if the only
> > current use case is to defer work from shrinkers that currently skip
> > work due to allocation context restraints, this might be better codified
> > with something like the appended (untested) example patch. This may or
> > may not be a preferable interface to the flag, but it's certainly not an
> > overcomplication...
> 
> I don't think this is the right way to go.
> 
> I want the filesystem shrinkers to become entirely non-blocking so
> that we can dynamically decide on an object-by-object basis whether
> we can reclaim the object in GFP_NOFS context.
> 

This is why I was asking about whether/how you envisioned the defer flag
looking in the future. Though I think this is somewhat orthogonal to the
discussion between having a bool or internal alloc mask set, because
both are of the same granularity and would need to change to operate on
a per objects basis.

> That is, a clean XFS inode that requires no special cleanup can be
> reclaimed even in GFP_NOFS context. The problem we have is that
> dentry reclaim can drop the last reference to an inode, causing
> inactivation and hence modification. However, if it's only going to
> move to the inode LRU and not evict the inode, we can reclaim that
> dentry. Similarly for inodes - if evicting the inode is not going to
> block or modify the inode, we can reclaim the inode even under
> GFP_NOFS constraints. And the same for XFS indoes - it if's clean
> we can reclaim it, GFP_NOFS context or not.
> 
> IMO, that's the direction we need to be heading in, and in those
> cases the "deferred work" tends towards a count of objects we could
> not reclaim during the scan because they require blocking work to be
> done. i.e. deferred work is a boolean now because the GFP_NOFS
> decision is boolean, but it's lays the ground work for deferred work
> to be integrated at a much finer-grained level in the shrinker
> scanning routines in future...
> 

Yeah, this sounds more like it warrants a ->nr_deferred field or some
such, which could ultimately replace either of the previously discussed
options for deferring the entire instance. BTW, ISTM we could use that
kind of interface now for exactly what this patch is trying to
accomplish by changing those shrinkers with allocation context
restrictions to just transfer the entire scan count to the deferred
count in ->scan_objects() instead of setting the flag. That's somewhat
less churn in the long run because we aren't shifting the defer logic
back and forth between the count and scan callbacks unnecessarily. IMO,
it's also a cleaner interface than both options above.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

[01/24] mm: directed shrinker work deferral

Commit Message

Comments

Patch