mbox series

[0/7] nfsd: filecache: change garbage collection lists

Message ID 20250127012257.1803314-1-neilb@suse.de (mailing list archive)
Headers show
Series nfsd: filecache: change garbage collection lists | expand

Message

NeilBrown Jan. 27, 2025, 1:20 a.m. UTC
[
davec added to cc incase I've said something incorrect about list_lru

Changes in this version:
  - no _bh locking
  - add name for a magic constant
  - remove unnecessary race-handling code
  - give a more meaningfule name for a lock for /proc/lock_stat
  - minor cleanups suggested by Jeff

]

The nfsd filecache currently uses  list_lru for tracking files recently
used in NFSv3 requests which need to be "garbage collected" when they
have becoming idle - unused for 2-4 seconds.

I do not believe list_lru is a good tool for this.  It does not allow
the timeout which filecache requires so we have to add a timeout
mechanism which holds the list_lru lock while the whole list is scanned
looking for entries that haven't been recently accessed.  When the list
is largish (even a few hundred) this can block new requests noticably
which need the lock to remove a file to access it.

This patch removes the list_lru and instead uses 2 simple linked lists.
When a file is accessed it is removed from whichever list it is on,
then added to the tail of the first list.  Every 2 seconds the second
list is moved to the "freeme" list and the first list is moved to the
second list.  This avoids any need to walk a list to find old entries.

These lists are per-netns rather than global as the freeme list is
per-netns as the actual freeing is done in nfsd threads which are
per-netns.

Thanks,
NeilBrown

 [PATCH 1/7] nfsd: filecache: remove race handling.
 [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in
 [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and
 [PATCH 4/7] nfsd: filecache: change garbage collection list
 [PATCH 5/7] nfsd: filecache: document the arbitrary limit on
 [PATCH 6/7] nfsd: filecache: change garbage collection to a timer.
 [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name.

Comments

Dave Chinner Jan. 28, 2025, 6:37 a.m. UTC | #1
On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
> [
> davec added to cc incase I've said something incorrect about list_lru
> 
> Changes in this version:
>   - no _bh locking
>   - add name for a magic constant
>   - remove unnecessary race-handling code
>   - give a more meaningfule name for a lock for /proc/lock_stat
>   - minor cleanups suggested by Jeff
> 
> ]
> 
> The nfsd filecache currently uses  list_lru for tracking files recently
> used in NFSv3 requests which need to be "garbage collected" when they
> have becoming idle - unused for 2-4 seconds.
> 
> I do not believe list_lru is a good tool for this.  It does not allow
> the timeout which filecache requires so we have to add a timeout
> mechanism which holds the list_lru lock while the whole list is scanned
> looking for entries that haven't been recently accessed.  When the list
> is largish (even a few hundred) this can block new requests noticably
> which need the lock to remove a file to access it.

Looks entirely like a trivial implementation bug in how the list_lru
is walked in nfsd_file_gc().

static void
nfsd_file_gc(void)
{
        LIST_HEAD(dispose);
        unsigned long ret;

        ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
                            &dispose, list_lru_count(&nfsd_file_lru));
			              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

        trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
        nfsd_file_dispose_list_delayed(&dispose);
}

i.e. the list_lru_walk() has been told to walk the entire list in a
single lock hold if nothing blocks it.

We've known this for a long, long time, and it's something we've
handled for a long time with shrinkers, too. here's the typical way
of doing a full list aging and GC pass in one go without excessively
long lock holds:

{
	long nr_to_scan = list_lru_count(&nfsd_file_lru);
	LIST_HEAD(dispose);

	while (nr_to_scan > 0) {
		long batch = min(nr_to_scan, 64);

		list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
				&dispose, batch);

		if (list_empty(&dispose))
			break;
		dispose_list(&dispose);
		nr_to_scan -= batch;
	}
}

And we don't need two lists to separate recently referenced vs
gc candidates because we have a referenced bit in the nf->nf_flags.
i.e.  nfsd_file_lru_cb() does:

nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
                 void *arg)
{
....
        /* If it was recently added to the list, skip it */
        if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
                trace_nfsd_file_gc_referenced(nf);
                return LRU_ROTATE;
        }
.....

Which moves recently referenced entries to the far end of the list,
resulting in all the reclaimable objects congrating at the end of
the list that is walked first by list_lru_walk().

IOWs, a batched walk like above resumes the walk exactly where it
left off, because it is always either reclaiming or rotating the
object at the head of the list.

> This patch removes the list_lru and instead uses 2 simple linked lists.
> When a file is accessed it is removed from whichever list it is on,
> then added to the tail of the first list.  Every 2 seconds the second
> list is moved to the "freeme" list and the first list is moved to the
> second list.  This avoids any need to walk a list to find old entries.

Yup, that's exactly what the current code does via the laundrette
work that schedules nfsd_file_gc() to run every two seconds does.

> These lists are per-netns rather than global as the freeme list is
> per-netns as the actual freeing is done in nfsd threads which are
> per-netns.

The list_lru is actually multiple lists - it is a per-numa node list
and so moving to global scope linked lists per netns is going to
reduce scalability and increase lock contention on large machines.

I also don't see any perf numbers, scalability analysis, latency
measurement, CPU profiles, etc showing the problems with using list_lru
for the GC function, nor any improvement this new code brings.

i.e. It's kinda hard to make any real comment on "I do not believe
list_lru is a good tool for this" when there is no actual
measurements provided to back the statement one way or the other...

-Dave.
Chuck Lever Jan. 28, 2025, 2:27 p.m. UTC | #2
On 1/28/25 1:37 AM, Dave Chinner wrote:
> On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
>> [
>> davec added to cc incase I've said something incorrect about list_lru
>>
>> Changes in this version:
>>    - no _bh locking
>>    - add name for a magic constant
>>    - remove unnecessary race-handling code
>>    - give a more meaningfule name for a lock for /proc/lock_stat
>>    - minor cleanups suggested by Jeff
>>
>> ]
>>
>> The nfsd filecache currently uses  list_lru for tracking files recently
>> used in NFSv3 requests which need to be "garbage collected" when they
>> have becoming idle - unused for 2-4 seconds.
>>
>> I do not believe list_lru is a good tool for this.  It does not allow
>> the timeout which filecache requires so we have to add a timeout
>> mechanism which holds the list_lru lock while the whole list is scanned
>> looking for entries that haven't been recently accessed.  When the list
>> is largish (even a few hundred) this can block new requests noticably
>> which need the lock to remove a file to access it.
> 
> Looks entirely like a trivial implementation bug in how the list_lru
> is walked in nfsd_file_gc().
> 
> static void
> nfsd_file_gc(void)
> {
>          LIST_HEAD(dispose);
>          unsigned long ret;
> 
>          ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>                              &dispose, list_lru_count(&nfsd_file_lru));
> 			              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
>          trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>          nfsd_file_dispose_list_delayed(&dispose);
> }
> 
> i.e. the list_lru_walk() has been told to walk the entire list in a
> single lock hold if nothing blocks it.
> 
> We've known this for a long, long time, and it's something we've
> handled for a long time with shrinkers, too. here's the typical way
> of doing a full list aging and GC pass in one go without excessively
> long lock holds:
> 
> {
> 	long nr_to_scan = list_lru_count(&nfsd_file_lru);
> 	LIST_HEAD(dispose);
> 
> 	while (nr_to_scan > 0) {
> 		long batch = min(nr_to_scan, 64);
> 
> 		list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> 				&dispose, batch);
> 
> 		if (list_empty(&dispose))
> 			break;
> 		dispose_list(&dispose);
> 		nr_to_scan -= batch;
> 	}
> }

The above is in fact similar to what we're planning to push first so
that it can be cleanly backported to LTS kernels:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-testing&id=9caea737d2cdfe2d194e225c1924090c1d68c25f


> And we don't need two lists to separate recently referenced vs
> gc candidates because we have a referenced bit in the nf->nf_flags.
> i.e.  nfsd_file_lru_cb() does:
> 
> nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>                   void *arg)
> {
> ....
>          /* If it was recently added to the list, skip it */
>          if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
>                  trace_nfsd_file_gc_referenced(nf);
>                  return LRU_ROTATE;
>          }
> .....
> 
> Which moves recently referenced entries to the far end of the list,
> resulting in all the reclaimable objects congrating at the end of
> the list that is walked first by list_lru_walk().

My concern (which I haven't voiced yet) about having two lists is that
it will increase memory traffic over the current single atomic bit
operation.


> IOWs, a batched walk like above resumes the walk exactly where it
> left off, because it is always either reclaiming or rotating the
> object at the head of the list.
> 
>> This patch removes the list_lru and instead uses 2 simple linked lists.
>> When a file is accessed it is removed from whichever list it is on,
>> then added to the tail of the first list.  Every 2 seconds the second
>> list is moved to the "freeme" list and the first list is moved to the
>> second list.  This avoids any need to walk a list to find old entries.
> 
> Yup, that's exactly what the current code does via the laundrette
> work that schedules nfsd_file_gc() to run every two seconds does.
> 
>> These lists are per-netns rather than global as the freeme list is
>> per-netns as the actual freeing is done in nfsd threads which are
>> per-netns.
> 
> The list_lru is actually multiple lists - it is a per-numa node list
> and so moving to global scope linked lists per netns is going to
> reduce scalability and increase lock contention on large machines.
> 
> I also don't see any perf numbers, scalability analysis, latency
> measurement, CPU profiles, etc showing the problems with using list_lru
> for the GC function, nor any improvement this new code brings.
> 
> i.e. It's kinda hard to make any real comment on "I do not believe
> list_lru is a good tool for this" when there is no actual
> measurements provided to back the statement one way or the other...

True, it would be good to get some comparative metrics; in particular
looking at spin lock contention and memory traffic.
Chuck Lever Jan. 28, 2025, 4:05 p.m. UTC | #3
On 1/28/25 9:27 AM, Chuck Lever wrote:
> On 1/28/25 1:37 AM, Dave Chinner wrote:
>> On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
>>> [
>>> davec added to cc incase I've said something incorrect about list_lru
>>>
>>> Changes in this version:
>>>    - no _bh locking
>>>    - add name for a magic constant
>>>    - remove unnecessary race-handling code
>>>    - give a more meaningfule name for a lock for /proc/lock_stat
>>>    - minor cleanups suggested by Jeff
>>>
>>> ]
>>>
>>> The nfsd filecache currently uses  list_lru for tracking files recently
>>> used in NFSv3 requests which need to be "garbage collected" when they
>>> have becoming idle - unused for 2-4 seconds.
>>>
>>> I do not believe list_lru is a good tool for this.  It does not allow
>>> the timeout which filecache requires so we have to add a timeout
>>> mechanism which holds the list_lru lock while the whole list is scanned
>>> looking for entries that haven't been recently accessed.  When the list
>>> is largish (even a few hundred) this can block new requests noticably
>>> which need the lock to remove a file to access it.
>>
>> Looks entirely like a trivial implementation bug in how the list_lru
>> is walked in nfsd_file_gc().
>>
>> static void
>> nfsd_file_gc(void)
>> {
>>          LIST_HEAD(dispose);
>>          unsigned long ret;
>>
>>          ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>>                              &dispose, list_lru_count(&nfsd_file_lru));
>>                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>>          trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>          nfsd_file_dispose_list_delayed(&dispose);
>> }
>>
>> i.e. the list_lru_walk() has been told to walk the entire list in a
>> single lock hold if nothing blocks it.
>>
>> We've known this for a long, long time, and it's something we've
>> handled for a long time with shrinkers, too. here's the typical way
>> of doing a full list aging and GC pass in one go without excessively
>> long lock holds:
>>
>> {
>>     long nr_to_scan = list_lru_count(&nfsd_file_lru);
>>     LIST_HEAD(dispose);
>>
>>     while (nr_to_scan > 0) {
>>         long batch = min(nr_to_scan, 64);
>>
>>         list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>>                 &dispose, batch);
>>
>>         if (list_empty(&dispose))
>>             break;
>>         dispose_list(&dispose);
>>         nr_to_scan -= batch;
>>     }
>> }
> 
> The above is in fact similar to what we're planning to push first so
> that it can be cleanly backported to LTS kernels:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/? 
> h=nfsd-testing&id=9caea737d2cdfe2d194e225c1924090c1d68c25f

I've rebased that branch. Here's a more permanent link:

https://lore.kernel.org/all/20250109142438.18689-2-cel@kernel.org/

But note that the batch size in the patch committed to my tree is 16
items, not 32.


>> And we don't need two lists to separate recently referenced vs
>> gc candidates because we have a referenced bit in the nf->nf_flags.
>> i.e.  nfsd_file_lru_cb() does:
>>
>> nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>>                   void *arg)
>> {
>> ....
>>          /* If it was recently added to the list, skip it */
>>          if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
>>                  trace_nfsd_file_gc_referenced(nf);
>>                  return LRU_ROTATE;
>>          }
>> .....
>>
>> Which moves recently referenced entries to the far end of the list,
>> resulting in all the reclaimable objects congrating at the end of
>> the list that is walked first by list_lru_walk().
> 
> My concern (which I haven't voiced yet) about having two lists is that
> it will increase memory traffic over the current single atomic bit
> operation.
> 
> 
>> IOWs, a batched walk like above resumes the walk exactly where it
>> left off, because it is always either reclaiming or rotating the
>> object at the head of the list.
>>
>>> This patch removes the list_lru and instead uses 2 simple linked lists.
>>> When a file is accessed it is removed from whichever list it is on,
>>> then added to the tail of the first list.  Every 2 seconds the second
>>> list is moved to the "freeme" list and the first list is moved to the
>>> second list.  This avoids any need to walk a list to find old entries.
>>
>> Yup, that's exactly what the current code does via the laundrette
>> work that schedules nfsd_file_gc() to run every two seconds does.
>>
>>> These lists are per-netns rather than global as the freeme list is
>>> per-netns as the actual freeing is done in nfsd threads which are
>>> per-netns.
>>
>> The list_lru is actually multiple lists - it is a per-numa node list
>> and so moving to global scope linked lists per netns is going to
>> reduce scalability and increase lock contention on large machines.
>>
>> I also don't see any perf numbers, scalability analysis, latency
>> measurement, CPU profiles, etc showing the problems with using list_lru
>> for the GC function, nor any improvement this new code brings.
>>
>> i.e. It's kinda hard to make any real comment on "I do not believe
>> list_lru is a good tool for this" when there is no actual
>> measurements provided to back the statement one way or the other...
> 
> True, it would be good to get some comparative metrics; in particular
> looking at spin lock contention and memory traffic.
> 
>
NeilBrown Jan. 29, 2025, 9:34 p.m. UTC | #4
On Tue, 28 Jan 2025, Dave Chinner wrote:
> On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
> > [
> > davec added to cc incase I've said something incorrect about list_lru
> > 
> > Changes in this version:
> >   - no _bh locking
> >   - add name for a magic constant
> >   - remove unnecessary race-handling code
> >   - give a more meaningfule name for a lock for /proc/lock_stat
> >   - minor cleanups suggested by Jeff
> > 
> > ]
> > 
> > The nfsd filecache currently uses  list_lru for tracking files recently
> > used in NFSv3 requests which need to be "garbage collected" when they
> > have becoming idle - unused for 2-4 seconds.
> > 
> > I do not believe list_lru is a good tool for this.  It does not allow
> > the timeout which filecache requires so we have to add a timeout
> > mechanism which holds the list_lru lock while the whole list is scanned
> > looking for entries that haven't been recently accessed.  When the list
> > is largish (even a few hundred) this can block new requests noticably
> > which need the lock to remove a file to access it.
> 
> Looks entirely like a trivial implementation bug in how the list_lru
> is walked in nfsd_file_gc().
> 
> static void
> nfsd_file_gc(void)
> {
>         LIST_HEAD(dispose);
>         unsigned long ret;
> 
>         ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>                             &dispose, list_lru_count(&nfsd_file_lru));
> 			              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
>         trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>         nfsd_file_dispose_list_delayed(&dispose);
> }
> 
> i.e. the list_lru_walk() has been told to walk the entire list in a
> single lock hold if nothing blocks it.
> 
> We've known this for a long, long time, and it's something we've
> handled for a long time with shrinkers, too. here's the typical way
> of doing a full list aging and GC pass in one go without excessively
> long lock holds:

"Typical"?? Can you please point me to an existing example?

> 
> {
> 	long nr_to_scan = list_lru_count(&nfsd_file_lru);
> 	LIST_HEAD(dispose);
> 
> 	while (nr_to_scan > 0) {
> 		long batch = min(nr_to_scan, 64);
> 
> 		list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> 				&dispose, batch);
> 
> 		if (list_empty(&dispose))
> 			break;
> 		dispose_list(&dispose);
> 		nr_to_scan -= batch;
> 	}
> }
> 
> And we don't need two lists to separate recently referenced vs
> gc candidates because we have a referenced bit in the nf->nf_flags.
> i.e.  nfsd_file_lru_cb() does:
> 
> nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>                  void *arg)
> {
> ....
>         /* If it was recently added to the list, skip it */
>         if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
>                 trace_nfsd_file_gc_referenced(nf);
>                 return LRU_ROTATE;
>         }
> .....
> 
> Which moves recently referenced entries to the far end of the list,
> resulting in all the reclaimable objects congrating at the end of
> the list that is walked first by list_lru_walk().
> 
> IOWs, a batched walk like above resumes the walk exactly where it
> left off, because it is always either reclaiming or rotating the
> object at the head of the list.

"rotating the object" to the head of the per-node list, not the head of
the whole list_Lru (except in the trivial single-node case).
list_lru_walk() iterates over the multiple node lists in a fixed order.
Suppose you have 4 nodes, each with 32 items, all of them referenced, and
a batch size of 64.
The first batch will process the 32 items on the first list clearing the
referenced bit and moving them to the end of that list.  Then continue
through those 32 again and freeing them all.  The second batch will do the
same for the second list.  The last two lists won't be touched.

list_lru_walk() is only ever used (correctly) for clearing out a
list_lru.  It should probably be replaced by a function with a more apt
name which is targeted at this: no spinlocks, no return value from the
call-back.

Yes, we could change the above code to use list_lru_walk_node and wrap a
for loop around the whole, but then I wonder if list_lru is really
giving us anything of value.

Walking a linked list just to set a bit in ever entry is a lot more work
that a few list manipulation in 5 cache-lines.

> 
> > This patch removes the list_lru and instead uses 2 simple linked lists.
> > When a file is accessed it is removed from whichever list it is on,
> > then added to the tail of the first list.  Every 2 seconds the second
> > list is moved to the "freeme" list and the first list is moved to the
> > second list.  This avoids any need to walk a list to find old entries.
> 
> Yup, that's exactly what the current code does via the laundrette
> work that schedules nfsd_file_gc() to run every two seconds does.
> 
> > These lists are per-netns rather than global as the freeme list is
> > per-netns as the actual freeing is done in nfsd threads which are
> > per-netns.
> 
> The list_lru is actually multiple lists - it is a per-numa node list
> and so moving to global scope linked lists per netns is going to
> reduce scalability and increase lock contention on large machines.

Possibly we could duplicate the filecache_disposal structure across
svc_pools instead of net namespaces.  That would fix the scalability
issue.  Probably we should avoid removing and re-adding the file in
the lru for every access as that probably causes more spinlock
contention.  We would need to adjust the ageing mechanism but i suspect
it could be made to work.

But I really wanted to make it correct first, performant later.  And I
still think that means not using list_lru.

> 
> I also don't see any perf numbers, scalability analysis, latency
> measurement, CPU profiles, etc showing the problems with using list_lru
> for the GC function, nor any improvement this new code brings.
> 
> i.e. It's kinda hard to make any real comment on "I do not believe
> list_lru is a good tool for this" when there is no actual
> measurements provided to back the statement one way or the other...

It's not about measurements, its about behavioural correctness.  Yes, I
should have spelled that out better.  Thanks for encouraging me to do
so.

NeilBrown