[4/6] nfsd: filecache: introduce NFSD_FILE_RECENT

Message ID	20250207051701.3467505-5-neilb@suse.de (mailing list archive)
State	Changes Requested
Delegated to:	Chuck Lever
Headers	show Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BED0C14F9C4 for <linux-nfs@vger.kernel.org>; Fri, 7 Feb 2025 05:17:41 +0000 (UTC) From: NeilBrown <neilb@suse.de> To: Chuck Lever <chuck.lever@oracle.com>, Jeff Layton <jlayton@kernel.org> Cc: linux-nfs@vger.kernel.org, Olga Kornievskaia <okorniev@redhat.com>, Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>, Dave Chinner <david@fromorbit.com> Subject: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT Date: Fri, 7 Feb 2025 16:15:14 +1100 Message-ID: <20250207051701.3467505-5-neilb@suse.de> In-Reply-To: <20250207051701.3467505-1-neilb@suse.de> References: <20250207051701.3467505-1-neilb@suse.de> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	nfsd: filecache: various fixes \| expand [0/6] nfsd: filecache: various fixes [1/6] nfsd: filecache: remove race handling. [2/6] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() [3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc() [4/6] nfsd: filecache: introduce NFSD_FILE_RECENT [5/6] nfsd: filecache: don't repeatedly add/remove files on the lru list [6/6] nfsd: filecache: drop the list_lru lock during lock gc scans

NeilBrown Feb. 7, 2025, 5:15 a.m. UTC

The filecache lru is walked in 2 circumstances for 2 different reasons.

1/ When called from the shrinker we want to discard the first few
   entries on the list, ignoring any with NFSD_FILE_REFERENCED set
   because they should really be at the end of the LRU as they have been
   referenced recently.  So those ones are ROTATED.

2/ When called from the nfsd_file_gc() timer function we want to discard
   anything that hasn't been used since before the previous call, and
   mark everything else as unused at this point in time.

Using the same flag for both of these can result in some unexpected
outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
nfsd_file_gc() will think the file hasn't been used in a while, while
really it has.

I think it is easier to reason about the behaviour if we instead have
two flags.

 NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
     put it there when convenient"
 NFSD_FILE_RECENT means "this has been used recently - since the last
     run of nfsd_file_gc()

When either caller finds an NFSD_FILE_REFERENCED entry, that entry
should be moved to the end of the LRU and the flag cleared.  This can
safely happen at any time.  The actual order on the lru might not be
strictly least-recently-used, but that is normal for linux lrus.

The shrinker callback can ignore the "recent" flag.  If it ends up
freeing something that is "recent" that simply means that memory
pressure is sufficient to limit the acceptable cache age to less than
the nfsd_file_gc frequency.

The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
free everything that doesn't have this flag set, and should clear the
flag on everything else.  When it clears the flag it is convenient to
clear the "REFERENCED" flag and move to the end of the LRU too.

With this, calls from the shrinker do not prematurely age files.  It
will focus only on freeing those that are least recently used.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/filecache.c | 21 +++++++++++++++++++--
 fs/nfsd/filecache.h |  1 +
 fs/nfsd/trace.h     |  3 +++
 3 files changed, 23 insertions(+), 2 deletions(-)

Chuck Lever Feb. 7, 2025, 2:52 p.m. UTC | #1

On 2/7/25 12:15 AM, NeilBrown wrote:
> The filecache lru is walked in 2 circumstances for 2 different reasons.
> 
> 1/ When called from the shrinker we want to discard the first few
>    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
>    because they should really be at the end of the LRU as they have been
>    referenced recently.  So those ones are ROTATED.
> 
> 2/ When called from the nfsd_file_gc() timer function we want to discard
>    anything that hasn't been used since before the previous call, and
>    mark everything else as unused at this point in time.
> 
> Using the same flag for both of these can result in some unexpected
> outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> nfsd_file_gc() will think the file hasn't been used in a while, while
> really it has.
> 
> I think it is easier to reason about the behaviour if we instead have
> two flags.
> 
>  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
>      put it there when convenient"
>  NFSD_FILE_RECENT means "this has been used recently - since the last
>      run of nfsd_file_gc()
> 
> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> should be moved to the end of the LRU and the flag cleared.  This can
> safely happen at any time.  The actual order on the lru might not be
> strictly least-recently-used, but that is normal for linux lrus.
> 
> The shrinker callback can ignore the "recent" flag.  If it ends up
> freeing something that is "recent" that simply means that memory
> pressure is sufficient to limit the acceptable cache age to less than
> the nfsd_file_gc frequency.
> 
> The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> free everything that doesn't have this flag set, and should clear the
> flag on everything else.  When it clears the flag it is convenient to
> clear the "REFERENCED" flag and move to the end of the LRU too.
> 
> With this, calls from the shrinker do not prematurely age files.  It
> will focus only on freeing those that are least recently used.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
>  fs/nfsd/filecache.h |  1 +
>  fs/nfsd/trace.h     |  3 +++
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 04588c03bdfe..9faf469354a5 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
>  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
>  }
>  
> -
>  static bool nfsd_file_lru_add(struct nfsd_file *nf)
>  {
>  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
>  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
>  		trace_nfsd_file_lru_add(nf);
>  		return true;
> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>  	return LRU_REMOVED;
>  }
>  
> +static enum lru_status
> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> +		 void *arg)
> +{
> +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> +
> +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> +		/* "REFERENCED" really means "should be at the end of the LRU.
> +		 * As we are putting it there we can clear the flag
> +		 */
> +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> +		trace_nfsd_file_gc_aged(nf);
> +		return LRU_ROTATE;
> +	}
> +	return nfsd_file_lru_cb(item, lru, arg);
> +}
> +
>  static void
>  nfsd_file_gc(void)
>  {
> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
>  
>  	for_each_node_state(nid, N_NORMAL_MEMORY) {
>  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
>  					  &dispose, &nr);
>  	}
>  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index d5db6b34ba30..de5b8aa7fcb0 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -38,6 +38,7 @@ struct nfsd_file {
>  #define NFSD_FILE_PENDING	(1)
>  #define NFSD_FILE_REFERENCED	(2)
>  #define NFSD_FILE_GC		(3)
> +#define NFSD_FILE_RECENT	(4)
>  	unsigned long		nf_flags;
>  	refcount_t		nf_ref;
>  	unsigned char		nf_may;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index ad2c0c432d08..9af723eeb2b0 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
>  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
>  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
>  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
> +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
>  		{ 1 << NFSD_FILE_GC,		"GC" })
>  
>  DECLARE_EVENT_CLASS(nfsd_file_class,
> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>  
>  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
>  	TP_ARGS(removed, remaining))
>  
>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>  
>  TRACE_EVENT(nfsd_file_close,

The other patches in this series look like solid improvements. This one
could be as well, but it will take me some time to understand it.

I am generally in favor of replacing the logic that removes and adds
these items with a single atomic bitop, and I'm happy to see NFSD stick
with the use of an existing LRU facility while documenting its unique
requirements ("nfsd_file_gc_aged" and so on).

I would still prefer the backport to be lighter -- looks like the key
changes are 3/6 and 6/6. Is there any chance the series can be
reorganized to facilitate backporting? I have to ask, and the answer
might be "no", I realize.

NeilBrown Feb. 9, 2025, 11:23 p.m. UTC | #2

On Sat, 08 Feb 2025, Chuck Lever wrote:
> On 2/7/25 12:15 AM, NeilBrown wrote:
> > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > 
> > 1/ When called from the shrinker we want to discard the first few
> >    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> >    because they should really be at the end of the LRU as they have been
> >    referenced recently.  So those ones are ROTATED.
> > 
> > 2/ When called from the nfsd_file_gc() timer function we want to discard
> >    anything that hasn't been used since before the previous call, and
> >    mark everything else as unused at this point in time.
> > 
> > Using the same flag for both of these can result in some unexpected
> > outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > nfsd_file_gc() will think the file hasn't been used in a while, while
> > really it has.
> > 
> > I think it is easier to reason about the behaviour if we instead have
> > two flags.
> > 
> >  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> >      put it there when convenient"
> >  NFSD_FILE_RECENT means "this has been used recently - since the last
> >      run of nfsd_file_gc()
> > 
> > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > should be moved to the end of the LRU and the flag cleared.  This can
> > safely happen at any time.  The actual order on the lru might not be
> > strictly least-recently-used, but that is normal for linux lrus.
> > 
> > The shrinker callback can ignore the "recent" flag.  If it ends up
> > freeing something that is "recent" that simply means that memory
> > pressure is sufficient to limit the acceptable cache age to less than
> > the nfsd_file_gc frequency.
> > 
> > The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> > free everything that doesn't have this flag set, and should clear the
> > flag on everything else.  When it clears the flag it is convenient to
> > clear the "REFERENCED" flag and move to the end of the LRU too.
> > 
> > With this, calls from the shrinker do not prematurely age files.  It
> > will focus only on freeing those that are least recently used.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> >  fs/nfsd/filecache.h |  1 +
> >  fs/nfsd/trace.h     |  3 +++
> >  3 files changed, 23 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 04588c03bdfe..9faf469354a5 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> >  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> >  }
> >  
> > -
> >  static bool nfsd_file_lru_add(struct nfsd_file *nf)
> >  {
> >  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> >  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> >  		trace_nfsd_file_lru_add(nf);
> >  		return true;
> > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> >  	return LRU_REMOVED;
> >  }
> >  
> > +static enum lru_status
> > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > +		 void *arg)
> > +{
> > +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > +
> > +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > +		/* "REFERENCED" really means "should be at the end of the LRU.
> > +		 * As we are putting it there we can clear the flag
> > +		 */
> > +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > +		trace_nfsd_file_gc_aged(nf);
> > +		return LRU_ROTATE;
> > +	}
> > +	return nfsd_file_lru_cb(item, lru, arg);
> > +}
> > +
> >  static void
> >  nfsd_file_gc(void)
> >  {
> > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> >  
> >  	for_each_node_state(nid, N_NORMAL_MEMORY) {
> >  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> >  					  &dispose, &nr);
> >  	}
> >  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > index d5db6b34ba30..de5b8aa7fcb0 100644
> > --- a/fs/nfsd/filecache.h
> > +++ b/fs/nfsd/filecache.h
> > @@ -38,6 +38,7 @@ struct nfsd_file {
> >  #define NFSD_FILE_PENDING	(1)
> >  #define NFSD_FILE_REFERENCED	(2)
> >  #define NFSD_FILE_GC		(3)
> > +#define NFSD_FILE_RECENT	(4)
> >  	unsigned long		nf_flags;
> >  	refcount_t		nf_ref;
> >  	unsigned char		nf_may;
> > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > index ad2c0c432d08..9af723eeb2b0 100644
> > --- a/fs/nfsd/trace.h
> > +++ b/fs/nfsd/trace.h
> > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> >  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
> >  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
> >  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
> > +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
> >  		{ 1 << NFSD_FILE_GC,		"GC" })
> >  
> >  DECLARE_EVENT_CLASS(nfsd_file_class,
> > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> >  
> >  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
> >  	TP_ARGS(removed, remaining))
> >  
> >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> >  
> >  TRACE_EVENT(nfsd_file_close,
> 
> The other patches in this series look like solid improvements. This one
> could be as well, but it will take me some time to understand it.
> 
> I am generally in favor of replacing the logic that removes and adds
> these items with a single atomic bitop, and I'm happy to see NFSD stick
> with the use of an existing LRU facility while documenting its unique
> requirements ("nfsd_file_gc_aged" and so on).
> 
> I would still prefer the backport to be lighter -- looks like the key
> changes are 3/6 and 6/6. Is there any chance the series can be
> reorganized to facilitate backporting? I have to ask, and the answer
> might be "no", I realize.

I'm going with "no".
To be honest, I was hoping that the complexity displayed here needed
to work around the assumptions of list_lru what don't match our needs
would be sufficient to convince you that list_lru isn't worth pursuing. 
I see that didn't work.

So I am no longer invested in this patch set.  You are welcome to use it
if you wish and to make any changes that you think are suitable, but I
don't think it is a good direction to go and will not be offering any
more code changes to support the use of list_lru here.

Thanks,
NeilBrown

Chuck Lever Feb. 10, 2025, 12:50 a.m. UTC | #3

On 2/9/25 6:23 PM, NeilBrown wrote:
> On Sat, 08 Feb 2025, Chuck Lever wrote:
>> On 2/7/25 12:15 AM, NeilBrown wrote:
>>> The filecache lru is walked in 2 circumstances for 2 different reasons.
>>>
>>> 1/ When called from the shrinker we want to discard the first few
>>>    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
>>>    because they should really be at the end of the LRU as they have been
>>>    referenced recently.  So those ones are ROTATED.
>>>
>>> 2/ When called from the nfsd_file_gc() timer function we want to discard
>>>    anything that hasn't been used since before the previous call, and
>>>    mark everything else as unused at this point in time.
>>>
>>> Using the same flag for both of these can result in some unexpected
>>> outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
>>> nfsd_file_gc() will think the file hasn't been used in a while, while
>>> really it has.
>>>
>>> I think it is easier to reason about the behaviour if we instead have
>>> two flags.
>>>
>>>  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
>>>      put it there when convenient"
>>>  NFSD_FILE_RECENT means "this has been used recently - since the last
>>>      run of nfsd_file_gc()
>>>
>>> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
>>> should be moved to the end of the LRU and the flag cleared.  This can
>>> safely happen at any time.  The actual order on the lru might not be
>>> strictly least-recently-used, but that is normal for linux lrus.
>>>
>>> The shrinker callback can ignore the "recent" flag.  If it ends up
>>> freeing something that is "recent" that simply means that memory
>>> pressure is sufficient to limit the acceptable cache age to less than
>>> the nfsd_file_gc frequency.
>>>
>>> The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
>>> free everything that doesn't have this flag set, and should clear the
>>> flag on everything else.  When it clears the flag it is convenient to
>>> clear the "REFERENCED" flag and move to the end of the LRU too.
>>>
>>> With this, calls from the shrinker do not prematurely age files.  It
>>> will focus only on freeing those that are least recently used.
>>>
>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>> ---
>>>  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
>>>  fs/nfsd/filecache.h |  1 +
>>>  fs/nfsd/trace.h     |  3 +++
>>>  3 files changed, 23 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index 04588c03bdfe..9faf469354a5 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
>>>  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
>>>  }
>>>  
>>> -
>>>  static bool nfsd_file_lru_add(struct nfsd_file *nf)
>>>  {
>>>  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>> +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
>>>  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
>>>  		trace_nfsd_file_lru_add(nf);
>>>  		return true;
>>> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>>>  	return LRU_REMOVED;
>>>  }
>>>  
>>> +static enum lru_status
>>> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
>>> +		 void *arg)
>>> +{
>>> +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
>>> +
>>> +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
>>> +		/* "REFERENCED" really means "should be at the end of the LRU.
>>> +		 * As we are putting it there we can clear the flag
>>> +		 */
>>> +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>> +		trace_nfsd_file_gc_aged(nf);
>>> +		return LRU_ROTATE;
>>> +	}
>>> +	return nfsd_file_lru_cb(item, lru, arg);
>>> +}
>>> +
>>>  static void
>>>  nfsd_file_gc(void)
>>>  {
>>> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
>>>  
>>>  	for_each_node_state(nid, N_NORMAL_MEMORY) {
>>>  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
>>> -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
>>> +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
>>>  					  &dispose, &nr);
>>>  	}
>>>  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
>>> index d5db6b34ba30..de5b8aa7fcb0 100644
>>> --- a/fs/nfsd/filecache.h
>>> +++ b/fs/nfsd/filecache.h
>>> @@ -38,6 +38,7 @@ struct nfsd_file {
>>>  #define NFSD_FILE_PENDING	(1)
>>>  #define NFSD_FILE_REFERENCED	(2)
>>>  #define NFSD_FILE_GC		(3)
>>> +#define NFSD_FILE_RECENT	(4)
>>>  	unsigned long		nf_flags;
>>>  	refcount_t		nf_ref;
>>>  	unsigned char		nf_may;
>>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>>> index ad2c0c432d08..9af723eeb2b0 100644
>>> --- a/fs/nfsd/trace.h
>>> +++ b/fs/nfsd/trace.h
>>> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
>>>  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
>>>  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
>>>  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
>>> +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
>>>  		{ 1 << NFSD_FILE_GC,		"GC" })
>>>  
>>>  DECLARE_EVENT_CLASS(nfsd_file_class,
>>> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
>>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
>>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
>>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
>>> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
>>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>>>  
>>>  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
>>> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
>>>  	TP_ARGS(removed, remaining))
>>>  
>>>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
>>> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
>>>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>>>  
>>>  TRACE_EVENT(nfsd_file_close,
>>
>> The other patches in this series look like solid improvements. This one
>> could be as well, but it will take me some time to understand it.
>>
>> I am generally in favor of replacing the logic that removes and adds
>> these items with a single atomic bitop, and I'm happy to see NFSD stick
>> with the use of an existing LRU facility while documenting its unique
>> requirements ("nfsd_file_gc_aged" and so on).
>>
>> I would still prefer the backport to be lighter -- looks like the key
>> changes are 3/6 and 6/6. Is there any chance the series can be
>> reorganized to facilitate backporting? I have to ask, and the answer
>> might be "no", I realize.
> 
> I'm going with "no".
> To be honest, I was hoping that the complexity displayed here needed
> to work around the assumptions of list_lru what don't match our needs
> would be sufficient to convince you that list_lru isn't worth pursuing. 
> I see that didn't work.

Fair enough.


> So I am no longer invested in this patch set.  You are welcome to use it
> if you wish and to make any changes that you think are suitable, but I
> don't think it is a good direction to go and will not be offering any
> more code changes to support the use of list_lru here.

If I may observe, you haven't offered a compelling explanation of why an
imperfect fit between list_lru and the filecache adds more technical
debt than does the introduction of a bespoke LRU mechanism.

I'm open to that argument, but I need stronger rationale (or performance
data) to back it up. So far I can agree that the defect rate in this
area is somewhat abnormal, but that seems to be because we don't
understand how to use the list_lru API to its best advantage.

NeilBrown Feb. 10, 2025, 2:31 a.m. UTC | #4

On Mon, 10 Feb 2025, Chuck Lever wrote:
> On 2/9/25 6:23 PM, NeilBrown wrote:
> > On Sat, 08 Feb 2025, Chuck Lever wrote:
> >> On 2/7/25 12:15 AM, NeilBrown wrote:
> >>> The filecache lru is walked in 2 circumstances for 2 different reasons.
> >>>
> >>> 1/ When called from the shrinker we want to discard the first few
> >>>    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> >>>    because they should really be at the end of the LRU as they have been
> >>>    referenced recently.  So those ones are ROTATED.
> >>>
> >>> 2/ When called from the nfsd_file_gc() timer function we want to discard
> >>>    anything that hasn't been used since before the previous call, and
> >>>    mark everything else as unused at this point in time.
> >>>
> >>> Using the same flag for both of these can result in some unexpected
> >>> outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> >>> nfsd_file_gc() will think the file hasn't been used in a while, while
> >>> really it has.
> >>>
> >>> I think it is easier to reason about the behaviour if we instead have
> >>> two flags.
> >>>
> >>>  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> >>>      put it there when convenient"
> >>>  NFSD_FILE_RECENT means "this has been used recently - since the last
> >>>      run of nfsd_file_gc()
> >>>
> >>> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> >>> should be moved to the end of the LRU and the flag cleared.  This can
> >>> safely happen at any time.  The actual order on the lru might not be
> >>> strictly least-recently-used, but that is normal for linux lrus.
> >>>
> >>> The shrinker callback can ignore the "recent" flag.  If it ends up
> >>> freeing something that is "recent" that simply means that memory
> >>> pressure is sufficient to limit the acceptable cache age to less than
> >>> the nfsd_file_gc frequency.
> >>>
> >>> The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> >>> free everything that doesn't have this flag set, and should clear the
> >>> flag on everything else.  When it clears the flag it is convenient to
> >>> clear the "REFERENCED" flag and move to the end of the LRU too.
> >>>
> >>> With this, calls from the shrinker do not prematurely age files.  It
> >>> will focus only on freeing those that are least recently used.
> >>>
> >>> Signed-off-by: NeilBrown <neilb@suse.de>
> >>> ---
> >>>  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> >>>  fs/nfsd/filecache.h |  1 +
> >>>  fs/nfsd/trace.h     |  3 +++
> >>>  3 files changed, 23 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> >>> index 04588c03bdfe..9faf469354a5 100644
> >>> --- a/fs/nfsd/filecache.c
> >>> +++ b/fs/nfsd/filecache.c
> >>> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> >>>  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> >>>  }
> >>>  
> >>> -
> >>>  static bool nfsd_file_lru_add(struct nfsd_file *nf)
> >>>  {
> >>>  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> >>> +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> >>>  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> >>>  		trace_nfsd_file_lru_add(nf);
> >>>  		return true;
> >>> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> >>>  	return LRU_REMOVED;
> >>>  }
> >>>  
> >>> +static enum lru_status
> >>> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> >>> +		 void *arg)
> >>> +{
> >>> +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> >>> +
> >>> +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> >>> +		/* "REFERENCED" really means "should be at the end of the LRU.
> >>> +		 * As we are putting it there we can clear the flag
> >>> +		 */
> >>> +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> >>> +		trace_nfsd_file_gc_aged(nf);
> >>> +		return LRU_ROTATE;
> >>> +	}
> >>> +	return nfsd_file_lru_cb(item, lru, arg);
> >>> +}
> >>> +
> >>>  static void
> >>>  nfsd_file_gc(void)
> >>>  {
> >>> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> >>>  
> >>>  	for_each_node_state(nid, N_NORMAL_MEMORY) {
> >>>  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> >>> -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> >>> +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> >>>  					  &dispose, &nr);
> >>>  	}
> >>>  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> >>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> >>> index d5db6b34ba30..de5b8aa7fcb0 100644
> >>> --- a/fs/nfsd/filecache.h
> >>> +++ b/fs/nfsd/filecache.h
> >>> @@ -38,6 +38,7 @@ struct nfsd_file {
> >>>  #define NFSD_FILE_PENDING	(1)
> >>>  #define NFSD_FILE_REFERENCED	(2)
> >>>  #define NFSD_FILE_GC		(3)
> >>> +#define NFSD_FILE_RECENT	(4)
> >>>  	unsigned long		nf_flags;
> >>>  	refcount_t		nf_ref;
> >>>  	unsigned char		nf_may;
> >>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> >>> index ad2c0c432d08..9af723eeb2b0 100644
> >>> --- a/fs/nfsd/trace.h
> >>> +++ b/fs/nfsd/trace.h
> >>> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> >>>  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
> >>>  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
> >>>  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
> >>> +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
> >>>  		{ 1 << NFSD_FILE_GC,		"GC" })
> >>>  
> >>>  DECLARE_EVENT_CLASS(nfsd_file_class,
> >>> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> >>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> >>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> >>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> >>> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> >>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> >>>  
> >>>  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> >>> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
> >>>  	TP_ARGS(removed, remaining))
> >>>  
> >>>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> >>> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> >>>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> >>>  
> >>>  TRACE_EVENT(nfsd_file_close,
> >>
> >> The other patches in this series look like solid improvements. This one
> >> could be as well, but it will take me some time to understand it.
> >>
> >> I am generally in favor of replacing the logic that removes and adds
> >> these items with a single atomic bitop, and I'm happy to see NFSD stick
> >> with the use of an existing LRU facility while documenting its unique
> >> requirements ("nfsd_file_gc_aged" and so on).
> >>
> >> I would still prefer the backport to be lighter -- looks like the key
> >> changes are 3/6 and 6/6. Is there any chance the series can be
> >> reorganized to facilitate backporting? I have to ask, and the answer
> >> might be "no", I realize.
> > 
> > I'm going with "no".
> > To be honest, I was hoping that the complexity displayed here needed
> > to work around the assumptions of list_lru what don't match our needs
> > would be sufficient to convince you that list_lru isn't worth pursuing. 
> > I see that didn't work.
> 
> Fair enough.
> 
> 
> > So I am no longer invested in this patch set.  You are welcome to use it
> > if you wish and to make any changes that you think are suitable, but I
> > don't think it is a good direction to go and will not be offering any
> > more code changes to support the use of list_lru here.
> 
> If I may observe, you haven't offered a compelling explanation of why an
> imperfect fit between list_lru and the filecache adds more technical
> debt than does the introduction of a bespoke LRU mechanism.
> 
> I'm open to that argument, but I need stronger rationale (or performance
> data) to back it up. So far I can agree that the defect rate in this
> area is somewhat abnormal, but that seems to be because we don't
> understand how to use the list_lru API to its best advantage.
> 

I would characterise the cause of the defect rate differently.
I would say it is because we are treating this as an lru-style problem
when it isn't an lru-style problem.  list_lru is great for lrus.  That
isn't what we have.

What we have is a desire to keep files open between consecutive IO
requests without any clear indication of when we have seen the last in a
series of IO requests.  So we make a policy decision "keep files open
until there have been no IOs for 2 seconds - then close them".
This is a good and sensible policy that nothing to do with the "LRU"
concept. 

We implement this policy by keeping all unused files on a list, set a
flag every time the file is used, clearing the flag on a timer tick
(every 2 seconds) and closing anything which still has the flag cleared
2 seconds later.

Still nothing in this description that is at all related to LRU
concepts.

Now we decide that it would be good the add a shinker - fair enough as
we don't *need* these to remain.  How should the shrinker choose files
to close?  It probably doesn't matter beyond avoiding files that still
have the not-timed-out flag set.

But we try to also impose an LRU disciple over the list, and we use
list_lru.
The interfaces for list_lru() are well documented but the intent is
not.  Most users of list_lru (gfs2/quota might be an exception) only
explicitly delete things from the lru when it is time to discard them
completely.  They rely on the shrinker to detect things that are in use
again, and to remove them.  And possibly to detect things that have been
referenced and to rotate them.  But if the shrinker doesn't run because
there isn't much memory pressure they are just left alone.

This is what list_lru is optimised for - for shrinker driven scanning
which skips or removes or rotates things that can't or shouldn't
be freed, and frees others.  You would expect to normally only scan a
small fraction of the list, because realistically you want to keep most
of them.

For filecache we don't want to keep them very long.  So I think it
matters a lot less what we choose for shrinking.  I'm tempted to suggest
we don't bother with the shrinker.  Old files will be discarded soon
anyway if they aren't used, and slowness in memory allocation (due to
any memory pressure) will naturally slow down the addition of new files
to the cache.  So the cache will shrink naturally.

I'm not 100% certain of that, but I do think that the needs of the
shrinker should not dominate the design as they currently do.

Note that maybe we *don't* need to close files so quickly.  Maybe we
could discard the whole timer thing, and then it would make sense to use
list_lru().  What is the cost of keeping them open?

All I can think of is that it affects unlink.  An unlinked file won't be
removed while there is a reference to the inode.  Maybe we should
address that by taking a lease on the file while it is in the
filecache??  When the lease is broken, we discard the file from the
cache. 

If that could work (which might involve creating a new internal lease
type that is only broken on unlink), then we could remove the timeout
and leave files in the cache indefinitely.  Then it would make perfect
sense to use list_lru() because the problem would start to look exactly
like an LRU problem.  But I don't think that is what we have today.

Thanks,
NeilBrown

Jeff Layton Feb. 10, 2025, 2:01 p.m. UTC | #5

On Fri, 2025-02-07 at 16:15 +1100, NeilBrown wrote:
> The filecache lru is walked in 2 circumstances for 2 different reasons.
> 
> 1/ When called from the shrinker we want to discard the first few
>    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
>    because they should really be at the end of the LRU as they have been
>    referenced recently.  So those ones are ROTATED.
> 
> 2/ When called from the nfsd_file_gc() timer function we want to discard
>    anything that hasn't been used since before the previous call, and
>    mark everything else as unused at this point in time.
> 
> Using the same flag for both of these can result in some unexpected
> outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> nfsd_file_gc() will think the file hasn't been used in a while, while
> really it has.
> 
> I think it is easier to reason about the behaviour if we instead have
> two flags.
> 
>  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
>      put it there when convenient"
>  NFSD_FILE_RECENT means "this has been used recently - since the last
>      run of nfsd_file_gc()
> 
> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> should be moved to the end of the LRU and the flag cleared.  This can
> safely happen at any time.  The actual order on the lru might not be
> strictly least-recently-used, but that is normal for linux lrus.
> 
> The shrinker callback can ignore the "recent" flag.  If it ends up
> freeing something that is "recent" that simply means that memory
> pressure is sufficient to limit the acceptable cache age to less than
> the nfsd_file_gc frequency.
> 
> The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> free everything that doesn't have this flag set, and should clear the
> flag on everything else.  When it clears the flag it is convenient to
> clear the "REFERENCED" flag and move to the end of the LRU too.
> 
> With this, calls from the shrinker do not prematurely age files.  It
> will focus only on freeing those that are least recently used.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
>  fs/nfsd/filecache.h |  1 +
>  fs/nfsd/trace.h     |  3 +++
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 04588c03bdfe..9faf469354a5 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
>  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
>  }
>  
> -
>  static bool nfsd_file_lru_add(struct nfsd_file *nf)
>  {
>  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);

Technically, I don't think you need the REFERENCED bit at all. This is
the only place it's set, and below this is calling list_lru_add_obj().
That returns false if the object was already on a per-node LRU.

Instead of that, you could add a list_lru helper that will rotate the
object to the end of its nodelist if it's already on one. OTOH, that
might mean more cross NUMA-node accesses to the spinlocks than we get
by using a flag and doing this at GC time.

>  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
>  		trace_nfsd_file_lru_add(nf);
>  		return true;
> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>  	return LRU_REMOVED;
>  }
>  
> +static enum lru_status
> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> +		 void *arg)
> +{
> +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> +
> +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> +		/* "REFERENCED" really means "should be at the end of the LRU.
> +		 * As we are putting it there we can clear the flag
> +		 */
> +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> +		trace_nfsd_file_gc_aged(nf);
> +		return LRU_ROTATE;
> +	}
> +	return nfsd_file_lru_cb(item, lru, arg);
> +}
> +
>  static void
>  nfsd_file_gc(void)
>  {
> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
>  
>  	for_each_node_state(nid, N_NORMAL_MEMORY) {
>  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
>  					  &dispose, &nr);
>  	}
>  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index d5db6b34ba30..de5b8aa7fcb0 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -38,6 +38,7 @@ struct nfsd_file {
>  #define NFSD_FILE_PENDING	(1)
>  #define NFSD_FILE_REFERENCED	(2)
>  #define NFSD_FILE_GC		(3)
> +#define NFSD_FILE_RECENT	(4)
>  	unsigned long		nf_flags;
>  	refcount_t		nf_ref;
>  	unsigned char		nf_may;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index ad2c0c432d08..9af723eeb2b0 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
>  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
>  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
>  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
> +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
>  		{ 1 << NFSD_FILE_GC,		"GC" })
>  
>  DECLARE_EVENT_CLASS(nfsd_file_class,
> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>  
>  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
>  	TP_ARGS(removed, remaining))
>  
>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>  
>  TRACE_EVENT(nfsd_file_close,

Reviewed-by: Jeff Layton <jlayton@kernel.org>

Jeff Layton Feb. 10, 2025, 2:25 p.m. UTC | #6

On Mon, 2025-02-10 at 13:31 +1100, NeilBrown wrote:
> On Mon, 10 Feb 2025, Chuck Lever wrote:
> > On 2/9/25 6:23 PM, NeilBrown wrote:
> > > On Sat, 08 Feb 2025, Chuck Lever wrote:
> > > > On 2/7/25 12:15 AM, NeilBrown wrote:
> > > > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > > > > 
> > > > > 1/ When called from the shrinker we want to discard the first few
> > > > >    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > > > >    because they should really be at the end of the LRU as they have been
> > > > >    referenced recently.  So those ones are ROTATED.
> > > > > 
> > > > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > > > >    anything that hasn't been used since before the previous call, and
> > > > >    mark everything else as unused at this point in time.
> > > > > 
> > > > > Using the same flag for both of these can result in some unexpected
> > > > > outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > > > really it has.
> > > > > 
> > > > > I think it is easier to reason about the behaviour if we instead have
> > > > > two flags.
> > > > > 
> > > > >  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > > > >      put it there when convenient"
> > > > >  NFSD_FILE_RECENT means "this has been used recently - since the last
> > > > >      run of nfsd_file_gc()
> > > > > 
> > > > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > > > should be moved to the end of the LRU and the flag cleared.  This can
> > > > > safely happen at any time.  The actual order on the lru might not be
> > > > > strictly least-recently-used, but that is normal for linux lrus.
> > > > > 
> > > > > The shrinker callback can ignore the "recent" flag.  If it ends up
> > > > > freeing something that is "recent" that simply means that memory
> > > > > pressure is sufficient to limit the acceptable cache age to less than
> > > > > the nfsd_file_gc frequency.
> > > > > 
> > > > > The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> > > > > free everything that doesn't have this flag set, and should clear the
> > > > > flag on everything else.  When it clears the flag it is convenient to
> > > > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > > > > 
> > > > > With this, calls from the shrinker do not prematurely age files.  It
> > > > > will focus only on freeing those that are least recently used.
> > > > > 
> > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > ---
> > > > >  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > > > >  fs/nfsd/filecache.h |  1 +
> > > > >  fs/nfsd/trace.h     |  3 +++
> > > > >  3 files changed, 23 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > index 04588c03bdfe..9faf469354a5 100644
> > > > > --- a/fs/nfsd/filecache.c
> > > > > +++ b/fs/nfsd/filecache.c
> > > > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > >  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > >  }
> > > > >  
> > > > > -
> > > > >  static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > > >  {
> > > > >  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > > > >  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> > > > >  		trace_nfsd_file_lru_add(nf);
> > > > >  		return true;
> > > > > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > >  	return LRU_REMOVED;
> > > > >  }
> > > > >  
> > > > > +static enum lru_status
> > > > > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > +		 void *arg)
> > > > > +{
> > > > > +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > > > > +
> > > > > +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > > > > +		/* "REFERENCED" really means "should be at the end of the LRU.
> > > > > +		 * As we are putting it there we can clear the flag
> > > > > +		 */
> > > > > +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > +		trace_nfsd_file_gc_aged(nf);
> > > > > +		return LRU_ROTATE;
> > > > > +	}
> > > > > +	return nfsd_file_lru_cb(item, lru, arg);
> > > > > +}
> > > > > +
> > > > >  static void
> > > > >  nfsd_file_gc(void)
> > > > >  {
> > > > > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> > > > >  
> > > > >  	for_each_node_state(nid, N_NORMAL_MEMORY) {
> > > > >  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > > > > -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > > > > +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> > > > >  					  &dispose, &nr);
> > > > >  	}
> > > > >  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > > > > index d5db6b34ba30..de5b8aa7fcb0 100644
> > > > > --- a/fs/nfsd/filecache.h
> > > > > +++ b/fs/nfsd/filecache.h
> > > > > @@ -38,6 +38,7 @@ struct nfsd_file {
> > > > >  #define NFSD_FILE_PENDING	(1)
> > > > >  #define NFSD_FILE_REFERENCED	(2)
> > > > >  #define NFSD_FILE_GC		(3)
> > > > > +#define NFSD_FILE_RECENT	(4)
> > > > >  	unsigned long		nf_flags;
> > > > >  	refcount_t		nf_ref;
> > > > >  	unsigned char		nf_may;
> > > > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > > > index ad2c0c432d08..9af723eeb2b0 100644
> > > > > --- a/fs/nfsd/trace.h
> > > > > +++ b/fs/nfsd/trace.h
> > > > > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> > > > >  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
> > > > >  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
> > > > >  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
> > > > > +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
> > > > >  		{ 1 << NFSD_FILE_GC,		"GC" })
> > > > >  
> > > > >  DECLARE_EVENT_CLASS(nfsd_file_class,
> > > > > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > > > > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> > > > >  
> > > > >  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > > > > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
> > > > >  	TP_ARGS(removed, remaining))
> > > > >  
> > > > >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > > > > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> > > > >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> > > > >  
> > > > >  TRACE_EVENT(nfsd_file_close,
> > > > 
> > > > The other patches in this series look like solid improvements. This one
> > > > could be as well, but it will take me some time to understand it.
> > > > 
> > > > I am generally in favor of replacing the logic that removes and adds
> > > > these items with a single atomic bitop, and I'm happy to see NFSD stick
> > > > with the use of an existing LRU facility while documenting its unique
> > > > requirements ("nfsd_file_gc_aged" and so on).
> > > > 
> > > > I would still prefer the backport to be lighter -- looks like the key
> > > > changes are 3/6 and 6/6. Is there any chance the series can be
> > > > reorganized to facilitate backporting? I have to ask, and the answer
> > > > might be "no", I realize.
> > > 
> > > I'm going with "no".
> > > To be honest, I was hoping that the complexity displayed here needed
> > > to work around the assumptions of list_lru what don't match our needs
> > > would be sufficient to convince you that list_lru isn't worth pursuing. 
> > > I see that didn't work.
> > 
> > Fair enough.
> > 
> > 
> > > So I am no longer invested in this patch set.  You are welcome to use it
> > > if you wish and to make any changes that you think are suitable, but I
> > > don't think it is a good direction to go and will not be offering any
> > > more code changes to support the use of list_lru here.
> > 
> > If I may observe, you haven't offered a compelling explanation of why an
> > imperfect fit between list_lru and the filecache adds more technical
> > debt than does the introduction of a bespoke LRU mechanism.
> > 
> > I'm open to that argument, but I need stronger rationale (or performance
> > data) to back it up. So far I can agree that the defect rate in this
> > area is somewhat abnormal, but that seems to be because we don't
> > understand how to use the list_lru API to its best advantage.
> > 
> 
> I would characterise the cause of the defect rate differently.
> I would say it is because we are treating this as an lru-style problem
> when it isn't an lru-style problem.  list_lru is great for lrus.  That
> isn't what we have.
> 
> What we have is a desire to keep files open between consecutive IO
> requests without any clear indication of when we have seen the last in a
> series of IO requests.  So we make a policy decision "keep files open
> until there have been no IOs for 2 seconds - then close them".
> This is a good and sensible policy that nothing to do with the "LRU"
> concept. 
> 
> We implement this policy by keeping all unused files on a list, set a
> flag every time the file is used, clearing the flag on a timer tick
> (every 2 seconds) and closing anything which still has the flag cleared
> 2 seconds later.
> 
> Still nothing in this description that is at all related to LRU
> concepts.
> 
> Now we decide that it would be good the add a shinker - fair enough as
> we don't *need* these to remain.  How should the shrinker choose files
> to close?  It probably doesn't matter beyond avoiding files that still
> have the not-timed-out flag set.
> 
> But we try to also impose an LRU disciple over the list, and we use
> list_lru.
> The interfaces for list_lru() are well documented but the intent is
> not.  Most users of list_lru (gfs2/quota might be an exception) only
> explicitly delete things from the lru when it is time to discard them
> completely.  They rely on the shrinker to detect things that are in use
> again, and to remove them.  And possibly to detect things that have been
> referenced and to rotate them.  But if the shrinker doesn't run because
> there isn't much memory pressure they are just left alone.
> 
> This is what list_lru is optimised for - for shrinker driven scanning
> which skips or removes or rotates things that can't or shouldn't
> be freed, and frees others.  You would expect to normally only scan a
> small fraction of the list, because realistically you want to keep most
> of them.
> 
> For filecache we don't want to keep them very long.  So I think it
> matters a lot less what we choose for shrinking.  I'm tempted to suggest
> we don't bother with the shrinker.  Old files will be discarded soon
> anyway if they aren't used, and slowness in memory allocation (due to
> any memory pressure) will naturally slow down the addition of new files
> to the cache.  So the cache will shrink naturally.
> 
> I'm not 100% certain of that, but I do think that the needs of the
> shrinker should not dominate the design as they currently do.
> 
> Note that maybe we *don't* need to close files so quickly.  Maybe we
> could discard the whole timer thing, and then it would make sense to use
> list_lru().  What is the cost of keeping them open?
> 
> All I can think of is that it affects unlink.  An unlinked file won't be
> removed while there is a reference to the inode.  Maybe we should
> address that by taking a lease on the file while it is in the
> filecache??  When the lease is broken, we discard the file from the
> cache. 


It may also affect other applications trying to take out leases. The
filecache has the  nfsd_file_lease_notifier that tells it when someone
is trying to take out a lease on a file. That happens then it will try
to close the file first.

>
> If that could work (which might involve creating a new internal lease
> type that is only broken on unlink), then we could remove the timeout
> and leave files in the cache indefinitely.  Then it would make perfect
> sense to use list_lru() because the problem would start to look exactly
> like an LRU problem.  But I don't think that is what we have today.
> 

The filecache already sets a fsnotify_mark on the inode to watch for
its i_nlink to go to 0, and then removes it from the cache when that
happens. I think we could keep these files open for quite a bit longer
if we chose to do so.

One thing that Chuck has brought up a few times is that maybe we should
consider making v4 not use the filecache at all. If that simplifies
things then that might be a good thing to consider as well.

Chuck Lever Feb. 10, 2025, 2:26 p.m. UTC | #7

On 2/9/25 9:31 PM, NeilBrown wrote:
> On Mon, 10 Feb 2025, Chuck Lever wrote:
>> On 2/9/25 6:23 PM, NeilBrown wrote:
>>> On Sat, 08 Feb 2025, Chuck Lever wrote:
>>>> On 2/7/25 12:15 AM, NeilBrown wrote:
>>>>> The filecache lru is walked in 2 circumstances for 2 different reasons.
>>>>>
>>>>> 1/ When called from the shrinker we want to discard the first few
>>>>>    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
>>>>>    because they should really be at the end of the LRU as they have been
>>>>>    referenced recently.  So those ones are ROTATED.
>>>>>
>>>>> 2/ When called from the nfsd_file_gc() timer function we want to discard
>>>>>    anything that hasn't been used since before the previous call, and
>>>>>    mark everything else as unused at this point in time.
>>>>>
>>>>> Using the same flag for both of these can result in some unexpected
>>>>> outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
>>>>> nfsd_file_gc() will think the file hasn't been used in a while, while
>>>>> really it has.
>>>>>
>>>>> I think it is easier to reason about the behaviour if we instead have
>>>>> two flags.
>>>>>
>>>>>  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
>>>>>      put it there when convenient"
>>>>>  NFSD_FILE_RECENT means "this has been used recently - since the last
>>>>>      run of nfsd_file_gc()
>>>>>
>>>>> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
>>>>> should be moved to the end of the LRU and the flag cleared.  This can
>>>>> safely happen at any time.  The actual order on the lru might not be
>>>>> strictly least-recently-used, but that is normal for linux lrus.
>>>>>
>>>>> The shrinker callback can ignore the "recent" flag.  If it ends up
>>>>> freeing something that is "recent" that simply means that memory
>>>>> pressure is sufficient to limit the acceptable cache age to less than
>>>>> the nfsd_file_gc frequency.
>>>>>
>>>>> The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
>>>>> free everything that doesn't have this flag set, and should clear the
>>>>> flag on everything else.  When it clears the flag it is convenient to
>>>>> clear the "REFERENCED" flag and move to the end of the LRU too.
>>>>>
>>>>> With this, calls from the shrinker do not prematurely age files.  It
>>>>> will focus only on freeing those that are least recently used.
>>>>>
>>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>>> ---
>>>>>  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
>>>>>  fs/nfsd/filecache.h |  1 +
>>>>>  fs/nfsd/trace.h     |  3 +++
>>>>>  3 files changed, 23 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>>>> index 04588c03bdfe..9faf469354a5 100644
>>>>> --- a/fs/nfsd/filecache.c
>>>>> +++ b/fs/nfsd/filecache.c
>>>>> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
>>>>>  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
>>>>>  }
>>>>>  
>>>>> -
>>>>>  static bool nfsd_file_lru_add(struct nfsd_file *nf)
>>>>>  {
>>>>>  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>>>> +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
>>>>>  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
>>>>>  		trace_nfsd_file_lru_add(nf);
>>>>>  		return true;
>>>>> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>>>>>  	return LRU_REMOVED;
>>>>>  }
>>>>>  
>>>>> +static enum lru_status
>>>>> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
>>>>> +		 void *arg)
>>>>> +{
>>>>> +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
>>>>> +
>>>>> +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
>>>>> +		/* "REFERENCED" really means "should be at the end of the LRU.
>>>>> +		 * As we are putting it there we can clear the flag
>>>>> +		 */
>>>>> +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>>>> +		trace_nfsd_file_gc_aged(nf);
>>>>> +		return LRU_ROTATE;
>>>>> +	}
>>>>> +	return nfsd_file_lru_cb(item, lru, arg);
>>>>> +}
>>>>> +
>>>>>  static void
>>>>>  nfsd_file_gc(void)
>>>>>  {
>>>>> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
>>>>>  
>>>>>  	for_each_node_state(nid, N_NORMAL_MEMORY) {
>>>>>  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
>>>>> -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
>>>>> +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
>>>>>  					  &dispose, &nr);
>>>>>  	}
>>>>>  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>>>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
>>>>> index d5db6b34ba30..de5b8aa7fcb0 100644
>>>>> --- a/fs/nfsd/filecache.h
>>>>> +++ b/fs/nfsd/filecache.h
>>>>> @@ -38,6 +38,7 @@ struct nfsd_file {
>>>>>  #define NFSD_FILE_PENDING	(1)
>>>>>  #define NFSD_FILE_REFERENCED	(2)
>>>>>  #define NFSD_FILE_GC		(3)
>>>>> +#define NFSD_FILE_RECENT	(4)
>>>>>  	unsigned long		nf_flags;
>>>>>  	refcount_t		nf_ref;
>>>>>  	unsigned char		nf_may;
>>>>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>>>>> index ad2c0c432d08..9af723eeb2b0 100644
>>>>> --- a/fs/nfsd/trace.h
>>>>> +++ b/fs/nfsd/trace.h
>>>>> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
>>>>>  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
>>>>>  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
>>>>>  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
>>>>> +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
>>>>>  		{ 1 << NFSD_FILE_GC,		"GC" })
>>>>>  
>>>>>  DECLARE_EVENT_CLASS(nfsd_file_class,
>>>>> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
>>>>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
>>>>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
>>>>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
>>>>> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
>>>>>  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>>>>>  
>>>>>  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
>>>>> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
>>>>>  	TP_ARGS(removed, remaining))
>>>>>  
>>>>>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
>>>>> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
>>>>>  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>>>>>  
>>>>>  TRACE_EVENT(nfsd_file_close,
>>>>
>>>> The other patches in this series look like solid improvements. This one
>>>> could be as well, but it will take me some time to understand it.
>>>>
>>>> I am generally in favor of replacing the logic that removes and adds
>>>> these items with a single atomic bitop, and I'm happy to see NFSD stick
>>>> with the use of an existing LRU facility while documenting its unique
>>>> requirements ("nfsd_file_gc_aged" and so on).
>>>>
>>>> I would still prefer the backport to be lighter -- looks like the key
>>>> changes are 3/6 and 6/6. Is there any chance the series can be
>>>> reorganized to facilitate backporting? I have to ask, and the answer
>>>> might be "no", I realize.
>>>
>>> I'm going with "no".
>>> To be honest, I was hoping that the complexity displayed here needed
>>> to work around the assumptions of list_lru what don't match our needs
>>> would be sufficient to convince you that list_lru isn't worth pursuing. 
>>> I see that didn't work.
>>
>> Fair enough.
>>
>>
>>> So I am no longer invested in this patch set.  You are welcome to use it
>>> if you wish and to make any changes that you think are suitable, but I
>>> don't think it is a good direction to go and will not be offering any
>>> more code changes to support the use of list_lru here.
>>
>> If I may observe, you haven't offered a compelling explanation of why an
>> imperfect fit between list_lru and the filecache adds more technical
>> debt than does the introduction of a bespoke LRU mechanism.
>>
>> I'm open to that argument, but I need stronger rationale (or performance
>> data) to back it up. So far I can agree that the defect rate in this
>> area is somewhat abnormal, but that seems to be because we don't
>> understand how to use the list_lru API to its best advantage.
>>
> 
> I would characterise the cause of the defect rate differently.
> I would say it is because we are treating this as an lru-style problem
> when it isn't an lru-style problem.  list_lru is great for lrus.  That
> isn't what we have.
> 
> What we have is a desire to keep files open between consecutive IO
> requests without any clear indication of when we have seen the last in a
> series of IO requests.  So we make a policy decision "keep files open
> until there have been no IOs for 2 seconds - then close them".
> This is a good and sensible policy that nothing to do with the "LRU"
> concept. 
> 
> We implement this policy by keeping all unused files on a list, set a
> flag every time the file is used, clearing the flag on a timer tick
> (every 2 seconds) and closing anything which still has the flag cleared
> 2 seconds later.
> 
> Still nothing in this description that is at all related to LRU
> concepts.
> 
> Now we decide that it would be good the add a shinker - fair enough as
> we don't *need* these to remain.  How should the shrinker choose files
> to close?  It probably doesn't matter beyond avoiding files that still
> have the not-timed-out flag set.
> 
> But we try to also impose an LRU disciple over the list, and we use
> list_lru.
> The interfaces for list_lru() are well documented but the intent is
> not.  Most users of list_lru (gfs2/quota might be an exception) only
> explicitly delete things from the lru when it is time to discard them
> completely.  They rely on the shrinker to detect things that are in use
> again, and to remove them.  And possibly to detect things that have been
> referenced and to rotate them.  But if the shrinker doesn't run because
> there isn't much memory pressure they are just left alone.
> 
> This is what list_lru is optimised for - for shrinker driven scanning
> which skips or removes or rotates things that can't or shouldn't
> be freed, and frees others.  You would expect to normally only scan a
> small fraction of the list, because realistically you want to keep most
> of them.
> 
> For filecache we don't want to keep them very long.  So I think it
> matters a lot less what we choose for shrinking.  I'm tempted to suggest
> we don't bother with the shrinker.  Old files will be discarded soon
> anyway if they aren't used, and slowness in memory allocation (due to
> any memory pressure) will naturally slow down the addition of new files
> to the cache.  So the cache will shrink naturally.
> 
> I'm not 100% certain of that, but I do think that the needs of the
> shrinker should not dominate the design as they currently do.

I'm willing to consider removing the shrinker, but IMO we can't have it
both ways: if we remove the shrinker because items age out quickly, then
we can't then leave them in the cache indefinitely.

I would need to be convinced that the filecache cannot grow without
bound -- there can't be any pathological cases here, and competing
namespaces can't starve each other.


> Note that maybe we *don't* need to close files so quickly.  Maybe we
> could discard the whole timer thing, and then it would make sense to use
> list_lru().  What is the cost of keeping them open?

The reason to age these items quickly is because we don't want open
NFSv3 files to block conflicting NFSv4 opens. Again, I think that
contraindicates leaving items in the filecache (or, at least, leaving
them open) indefinitely.


> All I can think of is that it affects unlink.  An unlinked file won't be
> removed while there is a reference to the inode.  Maybe we should
> address that by taking a lease on the file while it is in the
> filecache??  When the lease is broken, we discard the file from the
> cache. 
> 
> If that could work (which might involve creating a new internal lease
> type that is only broken on unlink), then we could remove the timeout
> and leave files in the cache indefinitely.  Then it would make perfect
> sense to use list_lru() because the problem would start to look exactly
> like an LRU problem.  But I don't think that is what we have today.

IMO, filecache does utilize the LRU characteristic of list_lru: the
shrinker uses that characteristic to select files that may be closed
early. Early on, ISTR there was a cap on the number of items that could
be cached -- again, that kind of mechanism would make direct use of an
LRU to evict only older items when the cache is full. So there are
technical rationales for the current code structure, but they might have
become historical as the code evolved.

Dave Chinner Feb. 10, 2025, 11:57 p.m. UTC | #8

On Mon, Feb 10, 2025 at 09:01:41AM -0500, Jeff Layton wrote:
> On Fri, 2025-02-07 at 16:15 +1100, NeilBrown wrote:
> > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > 
> > 1/ When called from the shrinker we want to discard the first few
> >    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> >    because they should really be at the end of the LRU as they have been
> >    referenced recently.  So those ones are ROTATED.
> > 
> > 2/ When called from the nfsd_file_gc() timer function we want to discard
> >    anything that hasn't been used since before the previous call, and
> >    mark everything else as unused at this point in time.
> > 
> > Using the same flag for both of these can result in some unexpected
> > outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > nfsd_file_gc() will think the file hasn't been used in a while, while
> > really it has.
> > 
> > I think it is easier to reason about the behaviour if we instead have
> > two flags.
> > 
> >  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> >      put it there when convenient"
> >  NFSD_FILE_RECENT means "this has been used recently - since the last
> >      run of nfsd_file_gc()
> > 
> > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > should be moved to the end of the LRU and the flag cleared.  This can
> > safely happen at any time.  The actual order on the lru might not be
> > strictly least-recently-used, but that is normal for linux lrus.
> > 
> > The shrinker callback can ignore the "recent" flag.  If it ends up
> > freeing something that is "recent" that simply means that memory
> > pressure is sufficient to limit the acceptable cache age to less than
> > the nfsd_file_gc frequency.
> > 
> > The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> > free everything that doesn't have this flag set, and should clear the
> > flag on everything else.  When it clears the flag it is convenient to
> > clear the "REFERENCED" flag and move to the end of the LRU too.
> > 
> > With this, calls from the shrinker do not prematurely age files.  It
> > will focus only on freeing those that are least recently used.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> >  fs/nfsd/filecache.h |  1 +
> >  fs/nfsd/trace.h     |  3 +++
> >  3 files changed, 23 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 04588c03bdfe..9faf469354a5 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> >  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> >  }
> >  
> > -
> >  static bool nfsd_file_lru_add(struct nfsd_file *nf)
> >  {
> >  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> 
> Technically, I don't think you need the REFERENCED bit at all. This is
> the only place it's set, and below this is calling list_lru_add_obj().
> That returns false if the object was already on a per-node LRU.
> 
> Instead of that, you could add a list_lru helper that will rotate the
> object to the end of its nodelist if it's already on one. OTOH, that
> might mean more cross NUMA-node accesses to the spinlocks than we get
> by using a flag and doing this at GC time.

No, please don't.

Per-object reference bits are required to enable lazy LRU rotation.
The LRU lists are -hot- objects; touching them every time we touch
an object on the LRU is prohibitively expensive because of exclusive
lock/cacheline contention. Hence we defer operations like rotation
to a context where we already have the list locked and cached
exclusively for some other reason (i.e. memory reclaim).

This is the same reason we use lazy removal from LRUs - it avoids
LRU list manipulations every time a hot cached object is accessed
and/or dropped.

IOWs, removing the per-object NFSD_FILE_REFERENCED bit will undo one
of the necessary the optimisations that allow hot caches LRU
management to work efficiently with minimal overhead.

-Dave.

Jeff Layton Feb. 11, 2025, 11:38 a.m. UTC | #9

On Tue, 2025-02-11 at 10:57 +1100, Dave Chinner wrote:
> On Mon, Feb 10, 2025 at 09:01:41AM -0500, Jeff Layton wrote:
> > On Fri, 2025-02-07 at 16:15 +1100, NeilBrown wrote:
> > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > > 
> > > 1/ When called from the shrinker we want to discard the first few
> > >    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > >    because they should really be at the end of the LRU as they have been
> > >    referenced recently.  So those ones are ROTATED.
> > > 
> > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > >    anything that hasn't been used since before the previous call, and
> > >    mark everything else as unused at this point in time.
> > > 
> > > Using the same flag for both of these can result in some unexpected
> > > outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > really it has.
> > > 
> > > I think it is easier to reason about the behaviour if we instead have
> > > two flags.
> > > 
> > >  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > >      put it there when convenient"
> > >  NFSD_FILE_RECENT means "this has been used recently - since the last
> > >      run of nfsd_file_gc()
> > > 
> > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > should be moved to the end of the LRU and the flag cleared.  This can
> > > safely happen at any time.  The actual order on the lru might not be
> > > strictly least-recently-used, but that is normal for linux lrus.
> > > 
> > > The shrinker callback can ignore the "recent" flag.  If it ends up
> > > freeing something that is "recent" that simply means that memory
> > > pressure is sufficient to limit the acceptable cache age to less than
> > > the nfsd_file_gc frequency.
> > > 
> > > The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> > > free everything that doesn't have this flag set, and should clear the
> > > flag on everything else.  When it clears the flag it is convenient to
> > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > > 
> > > With this, calls from the shrinker do not prematurely age files.  It
> > > will focus only on freeing those that are least recently used.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > >  fs/nfsd/filecache.h |  1 +
> > >  fs/nfsd/trace.h     |  3 +++
> > >  3 files changed, 23 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index 04588c03bdfe..9faf469354a5 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > >  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > >  }
> > >  
> > > -
> > >  static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > >  {
> > >  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > 
> > Technically, I don't think you need the REFERENCED bit at all. This is
> > the only place it's set, and below this is calling list_lru_add_obj().
> > That returns false if the object was already on a per-node LRU.
> > 
> > Instead of that, you could add a list_lru helper that will rotate the
> > object to the end of its nodelist if it's already on one. OTOH, that
> > might mean more cross NUMA-node accesses to the spinlocks than we get
> > by using a flag and doing this at GC time.
> 
> No, please don't.
> 
> Per-object reference bits are required to enable lazy LRU rotation.
> The LRU lists are -hot- objects; touching them every time we touch
> an object on the LRU is prohibitively expensive because of exclusive
> lock/cacheline contention. Hence we defer operations like rotation
> to a context where we already have the list locked and cached
> exclusively for some other reason (i.e. memory reclaim).
> 
> This is the same reason we use lazy removal from LRUs - it avoids
> LRU list manipulations every time a hot cached object is accessed
> and/or dropped.
> 
> IOWs, removing the per-object NFSD_FILE_REFERENCED bit will undo one
> of the necessary the optimisations that allow hot caches LRU
> management to work efficiently with minimal overhead.
> 

Yep, that was the point of my "OTOH" comment. Keeping the REFERENCED
flag is better from a "let's minimize cacheline invalidations"
standpoint.

NeilBrown Feb. 12, 2025, 10:39 p.m. UTC | #10

On Tue, 11 Feb 2025, Jeff Layton wrote:
> On Mon, 2025-02-10 at 13:31 +1100, NeilBrown wrote:
> > On Mon, 10 Feb 2025, Chuck Lever wrote:
> > > On 2/9/25 6:23 PM, NeilBrown wrote:
> > > > On Sat, 08 Feb 2025, Chuck Lever wrote:
> > > > > On 2/7/25 12:15 AM, NeilBrown wrote:
> > > > > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > > > > > 
> > > > > > 1/ When called from the shrinker we want to discard the first few
> > > > > >    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > > > > >    because they should really be at the end of the LRU as they have been
> > > > > >    referenced recently.  So those ones are ROTATED.
> > > > > > 
> > > > > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > > > > >    anything that hasn't been used since before the previous call, and
> > > > > >    mark everything else as unused at this point in time.
> > > > > > 
> > > > > > Using the same flag for both of these can result in some unexpected
> > > > > > outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > > > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > > > > really it has.
> > > > > > 
> > > > > > I think it is easier to reason about the behaviour if we instead have
> > > > > > two flags.
> > > > > > 
> > > > > >  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > > > > >      put it there when convenient"
> > > > > >  NFSD_FILE_RECENT means "this has been used recently - since the last
> > > > > >      run of nfsd_file_gc()
> > > > > > 
> > > > > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > > > > should be moved to the end of the LRU and the flag cleared.  This can
> > > > > > safely happen at any time.  The actual order on the lru might not be
> > > > > > strictly least-recently-used, but that is normal for linux lrus.
> > > > > > 
> > > > > > The shrinker callback can ignore the "recent" flag.  If it ends up
> > > > > > freeing something that is "recent" that simply means that memory
> > > > > > pressure is sufficient to limit the acceptable cache age to less than
> > > > > > the nfsd_file_gc frequency.
> > > > > > 
> > > > > > The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> > > > > > free everything that doesn't have this flag set, and should clear the
> > > > > > flag on everything else.  When it clears the flag it is convenient to
> > > > > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > > > > > 
> > > > > > With this, calls from the shrinker do not prematurely age files.  It
> > > > > > will focus only on freeing those that are least recently used.
> > > > > > 
> > > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > > ---
> > > > > >  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > > > > >  fs/nfsd/filecache.h |  1 +
> > > > > >  fs/nfsd/trace.h     |  3 +++
> > > > > >  3 files changed, 23 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > > index 04588c03bdfe..9faf469354a5 100644
> > > > > > --- a/fs/nfsd/filecache.c
> > > > > > +++ b/fs/nfsd/filecache.c
> > > > > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > > >  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > > >  }
> > > > > >  
> > > > > > -
> > > > > >  static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > > > >  {
> > > > > >  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > > +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > > > > >  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> > > > > >  		trace_nfsd_file_lru_add(nf);
> > > > > >  		return true;
> > > > > > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > >  	return LRU_REMOVED;
> > > > > >  }
> > > > > >  
> > > > > > +static enum lru_status
> > > > > > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > > +		 void *arg)
> > > > > > +{
> > > > > > +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > > > > > +
> > > > > > +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > > > > > +		/* "REFERENCED" really means "should be at the end of the LRU.
> > > > > > +		 * As we are putting it there we can clear the flag
> > > > > > +		 */
> > > > > > +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > > +		trace_nfsd_file_gc_aged(nf);
> > > > > > +		return LRU_ROTATE;
> > > > > > +	}
> > > > > > +	return nfsd_file_lru_cb(item, lru, arg);
> > > > > > +}
> > > > > > +
> > > > > >  static void
> > > > > >  nfsd_file_gc(void)
> > > > > >  {
> > > > > > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> > > > > >  
> > > > > >  	for_each_node_state(nid, N_NORMAL_MEMORY) {
> > > > > >  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > > > > > -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > > > > > +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> > > > > >  					  &dispose, &nr);
> > > > > >  	}
> > > > > >  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > > > > > index d5db6b34ba30..de5b8aa7fcb0 100644
> > > > > > --- a/fs/nfsd/filecache.h
> > > > > > +++ b/fs/nfsd/filecache.h
> > > > > > @@ -38,6 +38,7 @@ struct nfsd_file {
> > > > > >  #define NFSD_FILE_PENDING	(1)
> > > > > >  #define NFSD_FILE_REFERENCED	(2)
> > > > > >  #define NFSD_FILE_GC		(3)
> > > > > > +#define NFSD_FILE_RECENT	(4)
> > > > > >  	unsigned long		nf_flags;
> > > > > >  	refcount_t		nf_ref;
> > > > > >  	unsigned char		nf_may;
> > > > > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > > > > index ad2c0c432d08..9af723eeb2b0 100644
> > > > > > --- a/fs/nfsd/trace.h
> > > > > > +++ b/fs/nfsd/trace.h
> > > > > > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> > > > > >  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
> > > > > >  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
> > > > > >  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
> > > > > > +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
> > > > > >  		{ 1 << NFSD_FILE_GC,		"GC" })
> > > > > >  
> > > > > >  DECLARE_EVENT_CLASS(nfsd_file_class,
> > > > > > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> > > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> > > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> > > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > > > > > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> > > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> > > > > >  
> > > > > >  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > > > > > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
> > > > > >  	TP_ARGS(removed, remaining))
> > > > > >  
> > > > > >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > > > > > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> > > > > >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> > > > > >  
> > > > > >  TRACE_EVENT(nfsd_file_close,
> > > > > 
> > > > > The other patches in this series look like solid improvements. This one
> > > > > could be as well, but it will take me some time to understand it.
> > > > > 
> > > > > I am generally in favor of replacing the logic that removes and adds
> > > > > these items with a single atomic bitop, and I'm happy to see NFSD stick
> > > > > with the use of an existing LRU facility while documenting its unique
> > > > > requirements ("nfsd_file_gc_aged" and so on).
> > > > > 
> > > > > I would still prefer the backport to be lighter -- looks like the key
> > > > > changes are 3/6 and 6/6. Is there any chance the series can be
> > > > > reorganized to facilitate backporting? I have to ask, and the answer
> > > > > might be "no", I realize.
> > > > 
> > > > I'm going with "no".
> > > > To be honest, I was hoping that the complexity displayed here needed
> > > > to work around the assumptions of list_lru what don't match our needs
> > > > would be sufficient to convince you that list_lru isn't worth pursuing. 
> > > > I see that didn't work.
> > > 
> > > Fair enough.
> > > 
> > > 
> > > > So I am no longer invested in this patch set.  You are welcome to use it
> > > > if you wish and to make any changes that you think are suitable, but I
> > > > don't think it is a good direction to go and will not be offering any
> > > > more code changes to support the use of list_lru here.
> > > 
> > > If I may observe, you haven't offered a compelling explanation of why an
> > > imperfect fit between list_lru and the filecache adds more technical
> > > debt than does the introduction of a bespoke LRU mechanism.
> > > 
> > > I'm open to that argument, but I need stronger rationale (or performance
> > > data) to back it up. So far I can agree that the defect rate in this
> > > area is somewhat abnormal, but that seems to be because we don't
> > > understand how to use the list_lru API to its best advantage.
> > > 
> > 
> > I would characterise the cause of the defect rate differently.
> > I would say it is because we are treating this as an lru-style problem
> > when it isn't an lru-style problem.  list_lru is great for lrus.  That
> > isn't what we have.
> > 
> > What we have is a desire to keep files open between consecutive IO
> > requests without any clear indication of when we have seen the last in a
> > series of IO requests.  So we make a policy decision "keep files open
> > until there have been no IOs for 2 seconds - then close them".
> > This is a good and sensible policy that nothing to do with the "LRU"
> > concept. 
> > 
> > We implement this policy by keeping all unused files on a list, set a
> > flag every time the file is used, clearing the flag on a timer tick
> > (every 2 seconds) and closing anything which still has the flag cleared
> > 2 seconds later.
> > 
> > Still nothing in this description that is at all related to LRU
> > concepts.
> > 
> > Now we decide that it would be good the add a shinker - fair enough as
> > we don't *need* these to remain.  How should the shrinker choose files
> > to close?  It probably doesn't matter beyond avoiding files that still
> > have the not-timed-out flag set.
> > 
> > But we try to also impose an LRU disciple over the list, and we use
> > list_lru.
> > The interfaces for list_lru() are well documented but the intent is
> > not.  Most users of list_lru (gfs2/quota might be an exception) only
> > explicitly delete things from the lru when it is time to discard them
> > completely.  They rely on the shrinker to detect things that are in use
> > again, and to remove them.  And possibly to detect things that have been
> > referenced and to rotate them.  But if the shrinker doesn't run because
> > there isn't much memory pressure they are just left alone.
> > 
> > This is what list_lru is optimised for - for shrinker driven scanning
> > which skips or removes or rotates things that can't or shouldn't
> > be freed, and frees others.  You would expect to normally only scan a
> > small fraction of the list, because realistically you want to keep most
> > of them.
> > 
> > For filecache we don't want to keep them very long.  So I think it
> > matters a lot less what we choose for shrinking.  I'm tempted to suggest
> > we don't bother with the shrinker.  Old files will be discarded soon
> > anyway if they aren't used, and slowness in memory allocation (due to
> > any memory pressure) will naturally slow down the addition of new files
> > to the cache.  So the cache will shrink naturally.
> > 
> > I'm not 100% certain of that, but I do think that the needs of the
> > shrinker should not dominate the design as they currently do.
> > 
> > Note that maybe we *don't* need to close files so quickly.  Maybe we
> > could discard the whole timer thing, and then it would make sense to use
> > list_lru().  What is the cost of keeping them open?
> > 
> > All I can think of is that it affects unlink.  An unlinked file won't be
> > removed while there is a reference to the inode.  Maybe we should
> > address that by taking a lease on the file while it is in the
> > filecache??  When the lease is broken, we discard the file from the
> > cache. 
> 
> 
> It may also affect other applications trying to take out leases. The
> filecache has the  nfsd_file_lease_notifier that tells it when someone
> is trying to take out a lease on a file. That happens then it will try
> to close the file first.

It wouldn't have to be an FL_LEASE lease.  We could invent a new thing:
FL_CACHED which gets added to the locks list and triggers a notification
whenever anyone tries to break leases.

The nfsd_file_lease_notifier is interesting... I wasn't aware of that.


> 
> >
> > If that could work (which might involve creating a new internal lease
> > type that is only broken on unlink), then we could remove the timeout
> > and leave files in the cache indefinitely.  Then it would make perfect
> > sense to use list_lru() because the problem would start to look exactly
> > like an LRU problem.  But I don't think that is what we have today.
> > 
> 
> The filecache already sets a fsnotify_mark on the inode to watch for
> its i_nlink to go to 0, and then removes it from the cache when that
> happens. I think we could keep these files open for quite a bit longer
> if we chose to do so.

Chuck mentioned that holding he v3 files open longer could interfere
with v4 DENY_READ etc opens.  But I think they only every test for other
v4 opens.  A v3 (or local) open never blocks a v4 DENY open.  Does that
match your understanding?

Do you know of any other reason that we currently time out files after
2-4 seconds?

> 
> One thing that Chuck has brought up a few times is that maybe we should
> consider making v4 not use the filecache at all. If that simplifies
> things then that might be a good thing to consider as well.

As v4 stores the file with the open state it shouldn't need the cache.
Maybe the filecache makes life a bit easier for localio?

I don't know that it would make the garbage-collection/shrinking any
simpler but it does superficially seem like an unnecessary indirection. 
Do you know of any specific benefit it brings v4?

Thanks,
NeilBrown

Chuck Lever Feb. 13, 2025, 12:08 a.m. UTC | #11

On 2/12/25 5:39 PM, NeilBrown wrote:
> On Tue, 11 Feb 2025, Jeff Layton wrote:
>> One thing that Chuck has brought up a few times is that maybe we should
>> consider making v4 not use the filecache at all. If that simplifies
>> things then that might be a good thing to consider as well.
> 
> As v4 stores the file with the open state it shouldn't need the cache.
> Maybe the filecache makes life a bit easier for localio?
> 
> I don't know that it would make the garbage-collection/shrinking any
> simpler but it does superficially seem like an unnecessary indirection. 
> Do you know of any specific benefit it brings v4?

One benefit is that the fs/nfsd/vfs.c APIs can all deal with nfsd_file's
no matter which NFS version is in use.

Jeff Layton Feb. 13, 2025, 11:27 a.m. UTC | #12

On Thu, 2025-02-13 at 09:39 +1100, NeilBrown wrote:
> On Tue, 11 Feb 2025, Jeff Layton wrote:
> > On Mon, 2025-02-10 at 13:31 +1100, NeilBrown wrote:
> > > On Mon, 10 Feb 2025, Chuck Lever wrote:
> > > > On 2/9/25 6:23 PM, NeilBrown wrote:
> > > > > On Sat, 08 Feb 2025, Chuck Lever wrote:
> > > > > > On 2/7/25 12:15 AM, NeilBrown wrote:
> > > > > > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > > > > > > 
> > > > > > > 1/ When called from the shrinker we want to discard the first few
> > > > > > >    entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > > > > > >    because they should really be at the end of the LRU as they have been
> > > > > > >    referenced recently.  So those ones are ROTATED.
> > > > > > > 
> > > > > > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > > > > > >    anything that hasn't been used since before the previous call, and
> > > > > > >    mark everything else as unused at this point in time.
> > > > > > > 
> > > > > > > Using the same flag for both of these can result in some unexpected
> > > > > > > outcomes.  If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > > > > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > > > > > really it has.
> > > > > > > 
> > > > > > > I think it is easier to reason about the behaviour if we instead have
> > > > > > > two flags.
> > > > > > > 
> > > > > > >  NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > > > > > >      put it there when convenient"
> > > > > > >  NFSD_FILE_RECENT means "this has been used recently - since the last
> > > > > > >      run of nfsd_file_gc()
> > > > > > > 
> > > > > > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > > > > > should be moved to the end of the LRU and the flag cleared.  This can
> > > > > > > safely happen at any time.  The actual order on the lru might not be
> > > > > > > strictly least-recently-used, but that is normal for linux lrus.
> > > > > > > 
> > > > > > > The shrinker callback can ignore the "recent" flag.  If it ends up
> > > > > > > freeing something that is "recent" that simply means that memory
> > > > > > > pressure is sufficient to limit the acceptable cache age to less than
> > > > > > > the nfsd_file_gc frequency.
> > > > > > > 
> > > > > > > The gc caller should primarily focus on NFSD_FILE_RECENT.  It should
> > > > > > > free everything that doesn't have this flag set, and should clear the
> > > > > > > flag on everything else.  When it clears the flag it is convenient to
> > > > > > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > > > > > > 
> > > > > > > With this, calls from the shrinker do not prematurely age files.  It
> > > > > > > will focus only on freeing those that are least recently used.
> > > > > > > 
> > > > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > > > ---
> > > > > > >  fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > > > > > >  fs/nfsd/filecache.h |  1 +
> > > > > > >  fs/nfsd/trace.h     |  3 +++
> > > > > > >  3 files changed, 23 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > > > index 04588c03bdfe..9faf469354a5 100644
> > > > > > > --- a/fs/nfsd/filecache.c
> > > > > > > +++ b/fs/nfsd/filecache.c
> > > > > > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > > > >  		mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > > > >  }
> > > > > > >  
> > > > > > > -
> > > > > > >  static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > > > > >  {
> > > > > > >  	set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > > > +	set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > > > > > >  	if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> > > > > > >  		trace_nfsd_file_lru_add(nf);
> > > > > > >  		return true;
> > > > > > > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > > >  	return LRU_REMOVED;
> > > > > > >  }
> > > > > > >  
> > > > > > > +static enum lru_status
> > > > > > > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > > > +		 void *arg)
> > > > > > > +{
> > > > > > > +	struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > > > > > > +
> > > > > > > +	if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > > > > > > +		/* "REFERENCED" really means "should be at the end of the LRU.
> > > > > > > +		 * As we are putting it there we can clear the flag
> > > > > > > +		 */
> > > > > > > +		clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > > > +		trace_nfsd_file_gc_aged(nf);
> > > > > > > +		return LRU_ROTATE;
> > > > > > > +	}
> > > > > > > +	return nfsd_file_lru_cb(item, lru, arg);
> > > > > > > +}
> > > > > > > +
> > > > > > >  static void
> > > > > > >  nfsd_file_gc(void)
> > > > > > >  {
> > > > > > > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> > > > > > >  
> > > > > > >  	for_each_node_state(nid, N_NORMAL_MEMORY) {
> > > > > > >  		unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > > > > > > -		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > > > > > > +		ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> > > > > > >  					  &dispose, &nr);
> > > > > > >  	}
> > > > > > >  	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > > > > > > index d5db6b34ba30..de5b8aa7fcb0 100644
> > > > > > > --- a/fs/nfsd/filecache.h
> > > > > > > +++ b/fs/nfsd/filecache.h
> > > > > > > @@ -38,6 +38,7 @@ struct nfsd_file {
> > > > > > >  #define NFSD_FILE_PENDING	(1)
> > > > > > >  #define NFSD_FILE_REFERENCED	(2)
> > > > > > >  #define NFSD_FILE_GC		(3)
> > > > > > > +#define NFSD_FILE_RECENT	(4)
> > > > > > >  	unsigned long		nf_flags;
> > > > > > >  	refcount_t		nf_ref;
> > > > > > >  	unsigned char		nf_may;
> > > > > > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > > > > > index ad2c0c432d08..9af723eeb2b0 100644
> > > > > > > --- a/fs/nfsd/trace.h
> > > > > > > +++ b/fs/nfsd/trace.h
> > > > > > > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> > > > > > >  		{ 1 << NFSD_FILE_HASHED,	"HASHED" },		\
> > > > > > >  		{ 1 << NFSD_FILE_PENDING,	"PENDING" },		\
> > > > > > >  		{ 1 << NFSD_FILE_REFERENCED,	"REFERENCED" },		\
> > > > > > > +		{ 1 << NFSD_FILE_RECENT,	"RECENT" },		\
> > > > > > >  		{ 1 << NFSD_FILE_GC,		"GC" })
> > > > > > >  
> > > > > > >  DECLARE_EVENT_CLASS(nfsd_file_class,
> > > > > > > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> > > > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> > > > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> > > > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > > > > > > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> > > > > > >  DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> > > > > > >  
> > > > > > >  DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > > > > > > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name,				\
> > > > > > >  	TP_ARGS(removed, remaining))
> > > > > > >  
> > > > > > >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > > > > > > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> > > > > > >  DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> > > > > > >  
> > > > > > >  TRACE_EVENT(nfsd_file_close,
> > > > > > 
> > > > > > The other patches in this series look like solid improvements. This one
> > > > > > could be as well, but it will take me some time to understand it.
> > > > > > 
> > > > > > I am generally in favor of replacing the logic that removes and adds
> > > > > > these items with a single atomic bitop, and I'm happy to see NFSD stick
> > > > > > with the use of an existing LRU facility while documenting its unique
> > > > > > requirements ("nfsd_file_gc_aged" and so on).
> > > > > > 
> > > > > > I would still prefer the backport to be lighter -- looks like the key
> > > > > > changes are 3/6 and 6/6. Is there any chance the series can be
> > > > > > reorganized to facilitate backporting? I have to ask, and the answer
> > > > > > might be "no", I realize.
> > > > > 
> > > > > I'm going with "no".
> > > > > To be honest, I was hoping that the complexity displayed here needed
> > > > > to work around the assumptions of list_lru what don't match our needs
> > > > > would be sufficient to convince you that list_lru isn't worth pursuing. 
> > > > > I see that didn't work.
> > > > 
> > > > Fair enough.
> > > > 
> > > > 
> > > > > So I am no longer invested in this patch set.  You are welcome to use it
> > > > > if you wish and to make any changes that you think are suitable, but I
> > > > > don't think it is a good direction to go and will not be offering any
> > > > > more code changes to support the use of list_lru here.
> > > > 
> > > > If I may observe, you haven't offered a compelling explanation of why an
> > > > imperfect fit between list_lru and the filecache adds more technical
> > > > debt than does the introduction of a bespoke LRU mechanism.
> > > > 
> > > > I'm open to that argument, but I need stronger rationale (or performance
> > > > data) to back it up. So far I can agree that the defect rate in this
> > > > area is somewhat abnormal, but that seems to be because we don't
> > > > understand how to use the list_lru API to its best advantage.
> > > > 
> > > 
> > > I would characterise the cause of the defect rate differently.
> > > I would say it is because we are treating this as an lru-style problem
> > > when it isn't an lru-style problem.  list_lru is great for lrus.  That
> > > isn't what we have.
> > > 
> > > What we have is a desire to keep files open between consecutive IO
> > > requests without any clear indication of when we have seen the last in a
> > > series of IO requests.  So we make a policy decision "keep files open
> > > until there have been no IOs for 2 seconds - then close them".
> > > This is a good and sensible policy that nothing to do with the "LRU"
> > > concept. 
> > > 
> > > We implement this policy by keeping all unused files on a list, set a
> > > flag every time the file is used, clearing the flag on a timer tick
> > > (every 2 seconds) and closing anything which still has the flag cleared
> > > 2 seconds later.
> > > 
> > > Still nothing in this description that is at all related to LRU
> > > concepts.
> > > 
> > > Now we decide that it would be good the add a shinker - fair enough as
> > > we don't *need* these to remain.  How should the shrinker choose files
> > > to close?  It probably doesn't matter beyond avoiding files that still
> > > have the not-timed-out flag set.
> > > 
> > > But we try to also impose an LRU disciple over the list, and we use
> > > list_lru.
> > > The interfaces for list_lru() are well documented but the intent is
> > > not.  Most users of list_lru (gfs2/quota might be an exception) only
> > > explicitly delete things from the lru when it is time to discard them
> > > completely.  They rely on the shrinker to detect things that are in use
> > > again, and to remove them.  And possibly to detect things that have been
> > > referenced and to rotate them.  But if the shrinker doesn't run because
> > > there isn't much memory pressure they are just left alone.
> > > 
> > > This is what list_lru is optimised for - for shrinker driven scanning
> > > which skips or removes or rotates things that can't or shouldn't
> > > be freed, and frees others.  You would expect to normally only scan a
> > > small fraction of the list, because realistically you want to keep most
> > > of them.
> > > 
> > > For filecache we don't want to keep them very long.  So I think it
> > > matters a lot less what we choose for shrinking.  I'm tempted to suggest
> > > we don't bother with the shrinker.  Old files will be discarded soon
> > > anyway if they aren't used, and slowness in memory allocation (due to
> > > any memory pressure) will naturally slow down the addition of new files
> > > to the cache.  So the cache will shrink naturally.
> > > 
> > > I'm not 100% certain of that, but I do think that the needs of the
> > > shrinker should not dominate the design as they currently do.
> > > 
> > > Note that maybe we *don't* need to close files so quickly.  Maybe we
> > > could discard the whole timer thing, and then it would make sense to use
> > > list_lru().  What is the cost of keeping them open?
> > > 
> > > All I can think of is that it affects unlink.  An unlinked file won't be
> > > removed while there is a reference to the inode.  Maybe we should
> > > address that by taking a lease on the file while it is in the
> > > filecache??  When the lease is broken, we discard the file from the
> > > cache. 
> > 
> > 
> > It may also affect other applications trying to take out leases. The
> > filecache has the  nfsd_file_lease_notifier that tells it when someone
> > is trying to take out a lease on a file. That happens then it will try
> > to close the file first.
> 
> It wouldn't have to be an FL_LEASE lease.  We could invent a new thing:
> FL_CACHED which gets added to the locks list and triggers a notification
> whenever anyone tries to break leases.
> 
> The nfsd_file_lease_notifier is interesting... I wasn't aware of that.
> 
> 
> > 
> > > 
> > > If that could work (which might involve creating a new internal lease
> > > type that is only broken on unlink), then we could remove the timeout
> > > and leave files in the cache indefinitely.  Then it would make perfect
> > > sense to use list_lru() because the problem would start to look exactly
> > > like an LRU problem.  But I don't think that is what we have today.
> > > 
> > 
> > The filecache already sets a fsnotify_mark on the inode to watch for
> > its i_nlink to go to 0, and then removes it from the cache when that
> > happens. I think we could keep these files open for quite a bit longer
> > if we chose to do so.
> 
> Chuck mentioned that holding he v3 files open longer could interfere
> with v4 DENY_READ etc opens.  But I think they only every test for other
> v4 opens.  A v3 (or local) open never blocks a v4 DENY open.  Does that
> match your understanding?
> 

Yes, that's correct. nfsd's share/deny locking doesn't extend past
other NFSv4 files, so that shouldn't be a problem.

> Do you know of any other reason that we currently time out files after
> 2-4 seconds?
> 

No. The basic idea was to keep files open for "a little while" so that
when we're doing v3 READ/WRITE operations we can avoid having to do the
entire open/close. ~2-4s was considered long enough for that "little
while". We could easily keep them open longer, particularly since we
have some mechanisms to close them when needed for competing access
(leases and unlink activity, mostly).

> > 
> > One thing that Chuck has brought up a few times is that maybe we should
> > consider making v4 not use the filecache at all. If that simplifies
> > things then that might be a good thing to consider as well.
> 
> As v4 stores the file with the open state it shouldn't need the cache.
> Maybe the filecache makes life a bit easier for localio?
> 
> I don't know that it would make the garbage-collection/shrinking any
> simpler but it does superficially seem like an unnecessary indirection. 
> Do you know of any specific benefit it brings v4?
> 

None that I can think of.

Originally, I had thought that we would keep both v2/3 and v4 files in
the cache and that they would be shared. It hasn't really worked out
that way, so we could drop v4 files from the filecache altogether.

The only real "problem" is that struct nfsd_file is referenced in a lot
of nfsv4 code, so we'd need to do a big conversion back to it using
regular struct file pointers instead.

[4/6] nfsd: filecache: introduce NFSD_FILE_RECENT

Commit Message

Comments

Patch