nfsd: add scheduling point in nfsd_file_gc()

Message ID	20250105234022.2361576-2-neilb@suse.de (mailing list archive)
State	Changes Requested
Headers	show Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A8861465BA for <linux-nfs@vger.kernel.org>; Sun, 5 Jan 2025 23:40:56 +0000 (UTC) From: NeilBrown <neilb@suse.de> To: Chuck Lever <chuck.lever@oracle.com>, Jeff Layton <jlayton@kernel.org> Cc: linux-nfs@vger.kernel.org, Olga Kornievskaia <okorniev@redhat.com>, Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com> Subject: [PATCH] nfsd: add scheduling point in nfsd_file_gc() Date: Mon, 6 Jan 2025 10:11:59 +1100 Message-ID: <20250105234022.2361576-2-neilb@suse.de> In-Reply-To: <20250105234022.2361576-1-neilb@suse.de> References: <20250105234022.2361576-1-neilb@suse.de> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	nfsd: add scheduling point in nfsd_file_gc() \| expand nfsd: add scheduling point in nfsd_file_gc()

NeilBrown Jan. 5, 2025, 11:11 p.m. UTC

Under a high NFSv3 load with lots of different file being accessed The
list_lru of garbage-collectable files can become quite long.

Asking lisT_lru_scan() to scan the whole list can result in a long
period during which a spinlock is held and no scheduling is possible.
This is impolite.

So only ask list_lru_scan() to scan 1024 entries at a time, and repeat
if necessary - calling cond_resched() each time.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/filecache.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

Chuck Lever Jan. 6, 2025, 1:55 a.m. UTC | #1

On 1/5/25 6:11 PM, NeilBrown wrote:
> Under a high NFSv3 load with lots of different file being accessed The
> list_lru of garbage-collectable files can become quite long.
> 
> Asking lisT_lru_scan() to scan the whole list can result in a long
> period during which a spinlock is held and no scheduling is possible.
> This is impolite.
> 
> So only ask list_lru_scan() to scan 1024 entries at a time, and repeat
> if necessary - calling cond_resched() each time.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>   fs/nfsd/filecache.c | 17 ++++++++++++-----
>   1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a1cdba42c4fa..e99a86798e86 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -543,11 +543,18 @@ nfsd_file_gc(void)
>   {
>   	LIST_HEAD(dispose);
>   	unsigned long ret;
> -
> -	ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> -			    &dispose, list_lru_count(&nfsd_file_lru));
> -	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> -	nfsd_file_dispose_list_delayed(&dispose);
> +	unsigned long cnt = list_lru_count(&nfsd_file_lru);
> +
> +	while (cnt > 0) {

Since @cnt is unsigned, it should be safe to use "while (cnt) {"

(I might use "total" and "remaining" here -- "cnt", when said aloud,
leaves me snickering).

> +		unsigned long num_to_scan = min(cnt, 1024UL);

I see long delays with fewer than 1024 items on the list. I might
drop this number by one or two orders of magnitude. And make it a
symbolic constant.

There's another naked integer (8) in nfsd_file_net_dispose() -- how does
that relate to this new cap? Should that also be a symbolic constant?

> +		ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> +				    &dispose, num_to_scan);
> +		trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> +		nfsd_file_dispose_list_delayed(&dispose);

I need to go back and review the function traces to see where the
delays add up -- to make sure rescheduling here, rather than at some
other point, is appropriate. It probably is, but my memory fails me
these days.

> +		cnt -= num_to_scan;
> +		if (cnt)
> +			cond_resched();

Another approach might be to poke the laundrette again and simply
exit here, but I don't feel strongly about that.

> +	}
>   }
>   
>   static void

NeilBrown Jan. 6, 2025, 3:02 a.m. UTC | #2

On Mon, 06 Jan 2025, Chuck Lever wrote:
> On 1/5/25 6:11 PM, NeilBrown wrote:
> > Under a high NFSv3 load with lots of different file being accessed The
> > list_lru of garbage-collectable files can become quite long.
> > 
> > Asking lisT_lru_scan() to scan the whole list can result in a long
> > period during which a spinlock is held and no scheduling is possible.
> > This is impolite.
> > 
> > So only ask list_lru_scan() to scan 1024 entries at a time, and repeat
> > if necessary - calling cond_resched() each time.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >   fs/nfsd/filecache.c | 17 ++++++++++++-----
> >   1 file changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index a1cdba42c4fa..e99a86798e86 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -543,11 +543,18 @@ nfsd_file_gc(void)
> >   {
> >   	LIST_HEAD(dispose);
> >   	unsigned long ret;
> > -
> > -	ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> > -			    &dispose, list_lru_count(&nfsd_file_lru));
> > -	trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > -	nfsd_file_dispose_list_delayed(&dispose);
> > +	unsigned long cnt = list_lru_count(&nfsd_file_lru);
> > +
> > +	while (cnt > 0) {
> 
> Since @cnt is unsigned, it should be safe to use "while (cnt) {"

"while (cnt > 0) {" is equally safe and for me it emphasises that it
is a decreasing counter, not a Bool.  But different people have
different tastes.

> 
> (I might use "total" and "remaining" here -- "cnt", when said aloud,
> leaves me snickering).

"remaining" would be better than "cnt".

> 
> 
> > +		unsigned long num_to_scan = min(cnt, 1024UL);
> 
> I see long delays with fewer than 1024 items on the list. I might
> drop this number by one or two orders of magnitude. And make it a
> symbolic constant.

In that case I seriously wonder if this is where the delays are coming
from.

nfsd_file_dispose_list_delayed() does take and drop a spinlock
repeatedly (though it may not always be the same lock) and call
svc_wake_up() repeatedly - although the head of the queue might already
be woken.  We could optimise that to detect runs with the same nn and
only take the lock once, and only wake_up once.

> 
> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
> that relate to this new cap? Should that also be a symbolic constant?

I don't think they relate.
The trade-off with "8" is:
  a bigger number might block an nfsd thread for longer,
    forcing serialising when the work can usefully be done in parallel.
  a smaller number might needlessly wake lots of threads
    to share out a tiny amount of work.

The 1024 is simply about "don't hold a spinlock for too long".

> 
> 
> > +		ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> > +				    &dispose, num_to_scan);
> > +		trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > +		nfsd_file_dispose_list_delayed(&dispose);
> 
> I need to go back and review the function traces to see where the
> delays add up -- to make sure rescheduling here, rather than at some
> other point, is appropriate. It probably is, but my memory fails me
> these days.

I would like to see those function traces too.

> 
> 
> > +		cnt -= num_to_scan;
> > +		if (cnt)
> > +			cond_resched();
> 
> Another approach might be to poke the laundrette again and simply
> exit here, but I don't feel strongly about that.

That wouldn't work without storing the 'remaining' count somewhere.
With the current design we need to scan the whole list every 2 seconds.

Thanks,
NeilBrown


> 
> 
> > +	}
> >   }
> >   
> >   static void
> 
> 
> -- 
> Chuck Lever
>

Chuck Lever Jan. 6, 2025, 1:57 p.m. UTC | #3

On 1/5/25 10:02 PM, NeilBrown wrote:
> On Mon, 06 Jan 2025, Chuck Lever wrote:
>> On 1/5/25 6:11 PM, NeilBrown wrote:

>>> +		unsigned long num_to_scan = min(cnt, 1024UL);
>>
>> I see long delays with fewer than 1024 items on the list. I might
>> drop this number by one or two orders of magnitude. And make it a
>> symbolic constant.
> 
> In that case I seriously wonder if this is where the delays are coming
> from.
> 
> nfsd_file_dispose_list_delayed() does take and drop a spinlock
> repeatedly (though it may not always be the same lock) and call
> svc_wake_up() repeatedly - although the head of the queue might already
> be woken.  We could optimise that to detect runs with the same nn and
> only take the lock once, and only wake_up once.
> 
>>
>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
>> that relate to this new cap? Should that also be a symbolic constant?
> 
> I don't think they relate.
> The trade-off with "8" is:
>    a bigger number might block an nfsd thread for longer,
>      forcing serialising when the work can usefully be done in parallel.
>    a smaller number might needlessly wake lots of threads
>      to share out a tiny amount of work.
> 
> The 1024 is simply about "don't hold a spinlock for too long".

By that, I think you mean list_lru_walk() takes &l->lock for the
duration of the scan? For a long scan, that would effectively block
adding or removing LRU items for quite some time.

So here's a typical excerpt from a common test:

kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...

kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309 
entries removed, 2972 remaining

nfsd-1532  [015]   266.988626: nfsd_file_free: ...

Here, the nfsd_file_unhash record marks the beginning of the LRU
walk, and the nfsd_file_gc_removed record marks the end. The
timestamps indicate the walk took two milliseconds.

The nfsd_file_free record above marks the last disposal activity.
That takes almost a millisecond, but as far as I can tell, it
does not hold any locks for long.

This seems to me like a strong argument for cutting the scan size
down to no more than 32-64 items. Ideally spin locks are supposed
to be held only for simple operations (eg, list_add); this seems a
little outside that window (hence your remark that "a large
nr_to_walk is always a bad idea" -- I now see what you meant).

IMHO the patch description should single out that purpose: We want to
significantly reduce the maximum amount of time that list_lru_walk()
blocks foreground LRU activity such as list_lru_add_obj().

The cond_resched() in this case might be gravy.

>>> +		ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>>> +				    &dispose, num_to_scan);
>>> +		trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>> +		nfsd_file_dispose_list_delayed(&dispose);
>>
>> I need to go back and review the function traces to see where the
>> delays add up -- to make sure rescheduling here, rather than at some
>> other point, is appropriate. It probably is, but my memory fails me
>> these days.
> 
> I would like to see those function traces too.

Here's my reproducer:

1. On the client: Set up xfstests to use NFSv3
2. On the server: "sudo trace-cmd record -e nfsd &"
3. On the client: Run xfstests with "sudo ./check -nfs generic/750"

NeilBrown Jan. 7, 2025, 11:01 p.m. UTC | #4

On Tue, 07 Jan 2025, Chuck Lever wrote:
> On 1/5/25 10:02 PM, NeilBrown wrote:
> > On Mon, 06 Jan 2025, Chuck Lever wrote:
> >> On 1/5/25 6:11 PM, NeilBrown wrote:
> 
> >>> +		unsigned long num_to_scan = min(cnt, 1024UL);
> >>
> >> I see long delays with fewer than 1024 items on the list. I might
> >> drop this number by one or two orders of magnitude. And make it a
> >> symbolic constant.
> > 
> > In that case I seriously wonder if this is where the delays are coming
> > from.
> > 
> > nfsd_file_dispose_list_delayed() does take and drop a spinlock
> > repeatedly (though it may not always be the same lock) and call
> > svc_wake_up() repeatedly - although the head of the queue might already
> > be woken.  We could optimise that to detect runs with the same nn and
> > only take the lock once, and only wake_up once.
> > 
> >>
> >> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
> >> that relate to this new cap? Should that also be a symbolic constant?
> > 
> > I don't think they relate.
> > The trade-off with "8" is:
> >    a bigger number might block an nfsd thread for longer,
> >      forcing serialising when the work can usefully be done in parallel.
> >    a smaller number might needlessly wake lots of threads
> >      to share out a tiny amount of work.
> > 
> > The 1024 is simply about "don't hold a spinlock for too long".
> 
> By that, I think you mean list_lru_walk() takes &l->lock for the
> duration of the scan? For a long scan, that would effectively block
> adding or removing LRU items for quite some time.
> 
> So here's a typical excerpt from a common test:
> 
> kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...
> 
> kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309 
> entries removed, 2972 remaining
> 
> nfsd-1532  [015]   266.988626: nfsd_file_free: ...
> 
> Here, the nfsd_file_unhash record marks the beginning of the LRU
> walk, and the nfsd_file_gc_removed record marks the end. The
> timestamps indicate the walk took two milliseconds.
> 
> The nfsd_file_free record above marks the last disposal activity.
> That takes almost a millisecond, but as far as I can tell, it
> does not hold any locks for long.
> 
> This seems to me like a strong argument for cutting the scan size
> down to no more than 32-64 items. Ideally spin locks are supposed
> to be held only for simple operations (eg, list_add); this seems a
> little outside that window (hence your remark that "a large
> nr_to_walk is always a bad idea" -- I now see what you meant).

This is useful - thanks.
So the problem seems to be that holding the list_lru while canning the
whole list can block all incoming NFSv3 for a noticeable amount of time
- 2 msecs above.  That makes perfect sense and as you say it suggests
that the lack of scheduling points isn't really the issue.

This confirms for me that the list_lru approach is no a good fit for
this problem.  I have written a patch which replaces it with a pair of
simple lists as I described in my cover letter.

It is a bit large and needs careful review.  In particular I haven't
given thought to the tracepoints which might need to be moved or changed
or discarded.

Thanks,
NeilBrown

Chuck Lever Jan. 8, 2025, 1:39 p.m. UTC | #5

On 1/7/25 6:01 PM, NeilBrown wrote:
> On Tue, 07 Jan 2025, Chuck Lever wrote:
>> On 1/5/25 10:02 PM, NeilBrown wrote:
>>> On Mon, 06 Jan 2025, Chuck Lever wrote:
>>>> On 1/5/25 6:11 PM, NeilBrown wrote:
>>
>>>>> +		unsigned long num_to_scan = min(cnt, 1024UL);
>>>>
>>>> I see long delays with fewer than 1024 items on the list. I might
>>>> drop this number by one or two orders of magnitude. And make it a
>>>> symbolic constant.
>>>
>>> In that case I seriously wonder if this is where the delays are coming
>>> from.
>>>
>>> nfsd_file_dispose_list_delayed() does take and drop a spinlock
>>> repeatedly (though it may not always be the same lock) and call
>>> svc_wake_up() repeatedly - although the head of the queue might already
>>> be woken.  We could optimise that to detect runs with the same nn and
>>> only take the lock once, and only wake_up once.
>>>
>>>>
>>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
>>>> that relate to this new cap? Should that also be a symbolic constant?
>>>
>>> I don't think they relate.
>>> The trade-off with "8" is:
>>>     a bigger number might block an nfsd thread for longer,
>>>       forcing serialising when the work can usefully be done in parallel.
>>>     a smaller number might needlessly wake lots of threads
>>>       to share out a tiny amount of work.
>>>
>>> The 1024 is simply about "don't hold a spinlock for too long".
>>
>> By that, I think you mean list_lru_walk() takes &l->lock for the
>> duration of the scan? For a long scan, that would effectively block
>> adding or removing LRU items for quite some time.
>>
>> So here's a typical excerpt from a common test:
>>
>> kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...
>>
>> kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309
>> entries removed, 2972 remaining
>>
>> nfsd-1532  [015]   266.988626: nfsd_file_free: ...
>>
>> Here, the nfsd_file_unhash record marks the beginning of the LRU
>> walk, and the nfsd_file_gc_removed record marks the end. The
>> timestamps indicate the walk took two milliseconds.
>>
>> The nfsd_file_free record above marks the last disposal activity.
>> That takes almost a millisecond, but as far as I can tell, it
>> does not hold any locks for long.
>>
>> This seems to me like a strong argument for cutting the scan size
>> down to no more than 32-64 items. Ideally spin locks are supposed
>> to be held only for simple operations (eg, list_add); this seems a
>> little outside that window (hence your remark that "a large
>> nr_to_walk is always a bad idea" -- I now see what you meant).
> 
> This is useful - thanks.
> So the problem seems to be that holding the list_lru while canning the
> whole list can block all incoming NFSv3 for a noticeable amount of time
> - 2 msecs above.  That makes perfect sense and as you say it suggests
> that the lack of scheduling points isn't really the issue.
> 
> This confirms for me that the list_lru approach is no a good fit for
> this problem.  I have written a patch which replaces it with a pair of
> simple lists as I described in my cover letter.

Before proceeding with replacement of the LRU, is there interest in
addressing this issue in LTS kernels as well? If so, then IMO the
better approach would be to take a variant of your narrower fix for
v6.14, and then visit the deeper LRU changes for v6.15ff.


> It is a bit large and needs careful review.  In particular I haven't
> given thought to the tracepoints which might need to be moved or changed
> or discarded.

Some of those trace points are specific to the state machine that is
part of the list_lru scan. Those should be removed if we replace that
state machine.

NeilBrown Jan. 8, 2025, 8:41 p.m. UTC | #6

On Thu, 09 Jan 2025, Chuck Lever wrote:
> On 1/7/25 6:01 PM, NeilBrown wrote:
> > On Tue, 07 Jan 2025, Chuck Lever wrote:
> >> On 1/5/25 10:02 PM, NeilBrown wrote:
> >>> On Mon, 06 Jan 2025, Chuck Lever wrote:
> >>>> On 1/5/25 6:11 PM, NeilBrown wrote:
> >>
> >>>>> +		unsigned long num_to_scan = min(cnt, 1024UL);
> >>>>
> >>>> I see long delays with fewer than 1024 items on the list. I might
> >>>> drop this number by one or two orders of magnitude. And make it a
> >>>> symbolic constant.
> >>>
> >>> In that case I seriously wonder if this is where the delays are coming
> >>> from.
> >>>
> >>> nfsd_file_dispose_list_delayed() does take and drop a spinlock
> >>> repeatedly (though it may not always be the same lock) and call
> >>> svc_wake_up() repeatedly - although the head of the queue might already
> >>> be woken.  We could optimise that to detect runs with the same nn and
> >>> only take the lock once, and only wake_up once.
> >>>
> >>>>
> >>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
> >>>> that relate to this new cap? Should that also be a symbolic constant?
> >>>
> >>> I don't think they relate.
> >>> The trade-off with "8" is:
> >>>     a bigger number might block an nfsd thread for longer,
> >>>       forcing serialising when the work can usefully be done in parallel.
> >>>     a smaller number might needlessly wake lots of threads
> >>>       to share out a tiny amount of work.
> >>>
> >>> The 1024 is simply about "don't hold a spinlock for too long".
> >>
> >> By that, I think you mean list_lru_walk() takes &l->lock for the
> >> duration of the scan? For a long scan, that would effectively block
> >> adding or removing LRU items for quite some time.
> >>
> >> So here's a typical excerpt from a common test:
> >>
> >> kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...
> >>
> >> kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309
> >> entries removed, 2972 remaining
> >>
> >> nfsd-1532  [015]   266.988626: nfsd_file_free: ...
> >>
> >> Here, the nfsd_file_unhash record marks the beginning of the LRU
> >> walk, and the nfsd_file_gc_removed record marks the end. The
> >> timestamps indicate the walk took two milliseconds.
> >>
> >> The nfsd_file_free record above marks the last disposal activity.
> >> That takes almost a millisecond, but as far as I can tell, it
> >> does not hold any locks for long.
> >>
> >> This seems to me like a strong argument for cutting the scan size
> >> down to no more than 32-64 items. Ideally spin locks are supposed
> >> to be held only for simple operations (eg, list_add); this seems a
> >> little outside that window (hence your remark that "a large
> >> nr_to_walk is always a bad idea" -- I now see what you meant).
> > 
> > This is useful - thanks.
> > So the problem seems to be that holding the list_lru while canning the
> > whole list can block all incoming NFSv3 for a noticeable amount of time
> > - 2 msecs above.  That makes perfect sense and as you say it suggests
> > that the lack of scheduling points isn't really the issue.
> > 
> > This confirms for me that the list_lru approach is no a good fit for
> > this problem.  I have written a patch which replaces it with a pair of
> > simple lists as I described in my cover letter.
> 
> Before proceeding with replacement of the LRU, is there interest in
> addressing this issue in LTS kernels as well? If so, then IMO the
> better approach would be to take a variant of your narrower fix for
> v6.14, and then visit the deeper LRU changes for v6.15ff.

That is probably reasonable.  You could take the first patch, drop the
1024 to 64 (or less if testing suggests that is still too high), and
maybe drop he cond_resched().

Thanks,
NeilBrown

Chuck Lever Jan. 8, 2025, 10:27 p.m. UTC | #7

On 1/8/25 3:41 PM, NeilBrown wrote:
> On Thu, 09 Jan 2025, Chuck Lever wrote:
>> On 1/7/25 6:01 PM, NeilBrown wrote:
>>> On Tue, 07 Jan 2025, Chuck Lever wrote:
>>>> On 1/5/25 10:02 PM, NeilBrown wrote:
>>>>> On Mon, 06 Jan 2025, Chuck Lever wrote:
>>>>>> On 1/5/25 6:11 PM, NeilBrown wrote:
>>>>
>>>>>>> +		unsigned long num_to_scan = min(cnt, 1024UL);
>>>>>>
>>>>>> I see long delays with fewer than 1024 items on the list. I might
>>>>>> drop this number by one or two orders of magnitude. And make it a
>>>>>> symbolic constant.
>>>>>
>>>>> In that case I seriously wonder if this is where the delays are coming
>>>>> from.
>>>>>
>>>>> nfsd_file_dispose_list_delayed() does take and drop a spinlock
>>>>> repeatedly (though it may not always be the same lock) and call
>>>>> svc_wake_up() repeatedly - although the head of the queue might already
>>>>> be woken.  We could optimise that to detect runs with the same nn and
>>>>> only take the lock once, and only wake_up once.
>>>>>
>>>>>>
>>>>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
>>>>>> that relate to this new cap? Should that also be a symbolic constant?
>>>>>
>>>>> I don't think they relate.
>>>>> The trade-off with "8" is:
>>>>>      a bigger number might block an nfsd thread for longer,
>>>>>        forcing serialising when the work can usefully be done in parallel.
>>>>>      a smaller number might needlessly wake lots of threads
>>>>>        to share out a tiny amount of work.
>>>>>
>>>>> The 1024 is simply about "don't hold a spinlock for too long".
>>>>
>>>> By that, I think you mean list_lru_walk() takes &l->lock for the
>>>> duration of the scan? For a long scan, that would effectively block
>>>> adding or removing LRU items for quite some time.
>>>>
>>>> So here's a typical excerpt from a common test:
>>>>
>>>> kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...
>>>>
>>>> kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309
>>>> entries removed, 2972 remaining
>>>>
>>>> nfsd-1532  [015]   266.988626: nfsd_file_free: ...
>>>>
>>>> Here, the nfsd_file_unhash record marks the beginning of the LRU
>>>> walk, and the nfsd_file_gc_removed record marks the end. The
>>>> timestamps indicate the walk took two milliseconds.
>>>>
>>>> The nfsd_file_free record above marks the last disposal activity.
>>>> That takes almost a millisecond, but as far as I can tell, it
>>>> does not hold any locks for long.
>>>>
>>>> This seems to me like a strong argument for cutting the scan size
>>>> down to no more than 32-64 items. Ideally spin locks are supposed
>>>> to be held only for simple operations (eg, list_add); this seems a
>>>> little outside that window (hence your remark that "a large
>>>> nr_to_walk is always a bad idea" -- I now see what you meant).
>>>
>>> This is useful - thanks.
>>> So the problem seems to be that holding the list_lru while canning the
>>> whole list can block all incoming NFSv3 for a noticeable amount of time
>>> - 2 msecs above.  That makes perfect sense and as you say it suggests
>>> that the lack of scheduling points isn't really the issue.
>>>
>>> This confirms for me that the list_lru approach is no a good fit for
>>> this problem.  I have written a patch which replaces it with a pair of
>>> simple lists as I described in my cover letter.
>>
>> Before proceeding with replacement of the LRU, is there interest in
>> addressing this issue in LTS kernels as well? If so, then IMO the
>> better approach would be to take a variant of your narrower fix for
>> v6.14, and then visit the deeper LRU changes for v6.15ff.
> 
> That is probably reasonable.  You could take the first patch, drop the
> 1024 to 64 (or less if testing suggests that is still too high), and
> maybe drop he cond_resched().

I will make it so. Enjoy the rest of your leave!

nfsd: add scheduling point in nfsd_file_gc()

Commit Message

Comments

Patch