diff mbox series

[v1,2/2] NFSD: Change the filecache laundrette workqueue again

Message ID 20250103010002.619062-2-cel@kernel.org (mailing list archive)
State Changes Requested
Headers show
Series [v1,1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" | expand

Commit Message

Chuck Lever Jan. 3, 2025, 1 a.m. UTC
From: Chuck Lever <chuck.lever@oracle.com>

Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem
complaining about how long the filecache laundrette was running.
This resulted in switching from using the system_wq for the
laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use
system_unbound_wq for nfsd_file_gc_worker()").

However, I've seen the laundrette running for multiple milliseconds
on some workloads, delaying other work. For the purpose of
scheduling fairness, perhaps a better choice would be to process
filecache disposal queues on the system_long_wq instead.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/filecache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jeff Layton Jan. 3, 2025, 2:21 p.m. UTC | #1
On Thu, 2025-01-02 at 20:00 -0500, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem
> complaining about how long the filecache laundrette was running.
> This resulted in switching from using the system_wq for the
> laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use
> system_unbound_wq for nfsd_file_gc_worker()").
> 
> However, I've seen the laundrette running for multiple milliseconds
> on some workloads, delaying other work. For the purpose of
> scheduling fairness, perhaps a better choice would be to process
> filecache disposal queues on the system_long_wq instead.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/filecache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a1cdba42c4fa..91a535c2dede 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -112,7 +112,7 @@ static void
>  nfsd_file_schedule_laundrette(void)
>  {
>  	if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags))
> -		queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette,
> +		queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette,
>  				   NFSD_LAUNDRETTE_DELAY);
>  }
>  

TIL that there is system_long_wq! Seems like a reasonable thing to do
since this is generally low-priority work.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
NeilBrown Jan. 3, 2025, 10:50 p.m. UTC | #2
On Fri, 03 Jan 2025, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem
> complaining about how long the filecache laundrette was running.
> This resulted in switching from using the system_wq for the
> laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use
> system_unbound_wq for nfsd_file_gc_worker()").
> 
> However, I've seen the laundrette running for multiple milliseconds
> on some workloads, delaying other work. For the purpose of
> scheduling fairness, perhaps a better choice would be to process
> filecache disposal queues on the system_long_wq instead.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/filecache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a1cdba42c4fa..91a535c2dede 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -112,7 +112,7 @@ static void
>  nfsd_file_schedule_laundrette(void)
>  {
>  	if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags))
> -		queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette,
> +		queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette,
>  				   NFSD_LAUNDRETTE_DELAY);

This doesn't seem right to me.
"unbound_wq" is not bound to an CPU.  It just gets allocated that thread
which shouldn't interfere with any other work (except for normal
scheduling contention) until the queue limit is reach, which is unlikely
to happen often.

"long_wq" IS bound to a CPU and is concurrency managed so it could
interfere with other things.  The idea is that work items scheduled
there are not in a hurry and could take longer.  But I dont' think they
should take very long...

I think the problem is elsewhere.
list_lru_walk() can hold a spin lock for as long as it takes to walk
nr_to_walk objects.  This will tie up the CPU no matter which work queue
it is running on.

I think that instead of passing "list_lru_count()" we should pass some
constant like 1024.

cnt = list_lru_count()
while (cnt > 0) {
   num = min(cnt, 1024);
   list_lru_walk(...., num);
   cond_sched()
   cnt -= num;
}

Then run it from system_wq.

list_lru_shrink is most often called as list_lru_shrink_walk() from a
shrinker, and the pattern there is essentially that above.  A count is
taken, possibly scaled down, then the shrinker is called in batches.

NeilBrown
Tejun Heo Jan. 3, 2025, 10:53 p.m. UTC | #3
On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
> On Fri, 03 Jan 2025, cel@kernel.org wrote:
...
> I think that instead of passing "list_lru_count()" we should pass some
> constant like 1024.
> 
> cnt = list_lru_count()
> while (cnt > 0) {
>    num = min(cnt, 1024);
>    list_lru_walk(...., num);
>    cond_sched()
>    cnt -= num;
> }
> 
> Then run it from system_wq.
> 
> list_lru_shrink is most often called as list_lru_shrink_walk() from a
> shrinker, and the pattern there is essentially that above.  A count is
> taken, possibly scaled down, then the shrinker is called in batches.

BTW, there's nothing wrong with taking some msecs or even tens of msecs
running on system_unbound_wq, so the current state may be fine too.

Thanks.
Chuck Lever Jan. 4, 2025, 3:37 p.m. UTC | #4
On 1/3/25 5:53 PM, Tejun Heo wrote:
> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
>> On Fri, 03 Jan 2025, cel@kernel.org wrote:
> ...
>> I think that instead of passing "list_lru_count()" we should pass some
>> constant like 1024.
>>
>> cnt = list_lru_count()
>> while (cnt > 0) {
>>     num = min(cnt, 1024);
>>     list_lru_walk(...., num);
>>     cond_sched()
>>     cnt -= num;
>> }
>>
>> Then run it from system_wq.
>>
>> list_lru_shrink is most often called as list_lru_shrink_walk() from a
>> shrinker, and the pattern there is essentially that above.  A count is
>> taken, possibly scaled down, then the shrinker is called in batches.
> 
> BTW, there's nothing wrong with taking some msecs or even tens of msecs
> running on system_unbound_wq, so the current state may be fine too.

My thinking was that this work is low priority, so there should be
plenty of opportunity to set it aside for a few moments and handle
higher priority work. Maybe not worrisome on systems with a high core
count, but on small SMP (eg VM guests), I've found that tasks like this
can be rude neighbors.

We could do this by adding a cond_resched() call site in the loop,
or take Neil's suggestion of breaking up the free list across multiple
work items that handle one or just a few file releases each.
NeilBrown Jan. 5, 2025, 10:28 p.m. UTC | #5
On Sun, 05 Jan 2025, Chuck Lever wrote:
> On 1/3/25 5:53 PM, Tejun Heo wrote:
> > On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
> >> On Fri, 03 Jan 2025, cel@kernel.org wrote:
> > ...
> >> I think that instead of passing "list_lru_count()" we should pass some
> >> constant like 1024.
> >>
> >> cnt = list_lru_count()
> >> while (cnt > 0) {
> >>     num = min(cnt, 1024);
> >>     list_lru_walk(...., num);
> >>     cond_sched()
> >>     cnt -= num;
> >> }
> >>
> >> Then run it from system_wq.
> >>
> >> list_lru_shrink is most often called as list_lru_shrink_walk() from a
> >> shrinker, and the pattern there is essentially that above.  A count is
> >> taken, possibly scaled down, then the shrinker is called in batches.
> > 
> > BTW, there's nothing wrong with taking some msecs or even tens of msecs
> > running on system_unbound_wq, so the current state may be fine too.
> 
> My thinking was that this work is low priority, so there should be
> plenty of opportunity to set it aside for a few moments and handle
> higher priority work. Maybe not worrisome on systems with a high core
> count, but on small SMP (eg VM guests), I've found that tasks like this
> can be rude neighbors.

None of the different work queues are expected to "set aside" the work
for more than a normal scheduling time slice.
system_long_wq was created "so that flush_scheduled_work() doesn't take
so long" but flush_scheduled_work() is never called now (and would
generate a warning if it was) so system_long_wq should really be
removed. 

system_unbound_wq uses threads that are not bound to a CPU so the
scheduler can move them around.  That is most suitable for something
that might run for a long time.

system_wq is bound to the CPU that schedules the work, but it can run
multiple work items - potentially long running ones - without trouble by
helping the scheduler share the CPU among them.  This requires that they
can sleep of course.

I haven't seen the actually symptoms that resulted in the various
changes to this code, but that last point is I think the key one.  The
work item, like all kernel code, needs to have scheduler points if it
might run for a longish time.  list_lru_walk() doesn't contain any
scheduling points so giving a large "nr_to_walk" is always a bad idea.

> 
> We could do this by adding a cond_resched() call site in the loop,
> or take Neil's suggestion of breaking up the free list across multiple
> work items that handle one or just a few file releases each.

I guess I should send a proper patch.

NeilBrown

> 
> -- 
> Chuck Lever
>
Chuck Lever Jan. 6, 2025, 1:39 a.m. UTC | #6
On 1/5/25 5:28 PM, NeilBrown wrote:
> On Sun, 05 Jan 2025, Chuck Lever wrote:
>> On 1/3/25 5:53 PM, Tejun Heo wrote:
>>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
>>>> On Fri, 03 Jan 2025, cel@kernel.org wrote:
>>> ...
>>>> I think that instead of passing "list_lru_count()" we should pass some
>>>> constant like 1024.
>>>>
>>>> cnt = list_lru_count()
>>>> while (cnt > 0) {
>>>>      num = min(cnt, 1024);
>>>>      list_lru_walk(...., num);
>>>>      cond_sched()
>>>>      cnt -= num;
>>>> }
>>>>
>>>> Then run it from system_wq.
>>>>
>>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a
>>>> shrinker, and the pattern there is essentially that above.  A count is
>>>> taken, possibly scaled down, then the shrinker is called in batches.
>>>
>>> BTW, there's nothing wrong with taking some msecs or even tens of msecs
>>> running on system_unbound_wq, so the current state may be fine too.
>>
>> My thinking was that this work is low priority, so there should be
>> plenty of opportunity to set it aside for a few moments and handle
>> higher priority work. Maybe not worrisome on systems with a high core
>> count, but on small SMP (eg VM guests), I've found that tasks like this
>> can be rude neighbors.
> 
> None of the different work queues are expected to "set aside" the work
> for more than a normal scheduling time slice.
> system_long_wq was created "so that flush_scheduled_work() doesn't take
> so long" but flush_scheduled_work() is never called now (and would
> generate a warning if it was) so system_long_wq should really be
> removed.
> 
> system_unbound_wq uses threads that are not bound to a CPU so the
> scheduler can move them around.  That is most suitable for something
> that might run for a long time.

That's not my understanding: unbound selects a CPU to run the work
item on, and it can be (and usually is) not the same CPU as the one
that invoked queue_work(). Then it isn't rescheduled. The work items
are expected to complete quickly; work item termination is the typical
reschedule point. My understanding, as always, could be stale.

But that's neither here nor there: I've dropped the patch to replace
system_unbound_wq with system_long_wq.


> system_wq is bound to the CPU that schedules the work, but it can run
> multiple work items - potentially long running ones - without trouble by
> helping the scheduler share the CPU among them.  This requires that they
> can sleep of course.
> 
> I haven't seen the actually symptoms that resulted in the various
> changes to this code, but that last point is I think the key one.  The
> work item, like all kernel code, needs to have scheduler points if it
> might run for a longish time.  list_lru_walk() doesn't contain any
> scheduling points so giving a large "nr_to_walk" is always a bad idea.

That might be a overstated... I don't see other consumers that are so
concerned about rescheduling. But then it's not clear if they are
dealing with lengthy LRU lists.


>> We could do this by adding a cond_resched() call site in the loop,
>> or take Neil's suggestion of breaking up the free list across multiple
>> work items that handle one or just a few file releases each.
> 
> I guess I should send a proper patch.

More comments in reply to that patch. Generally speaking, I'm
comfortable chopping up the work as you propose, we just have to
address some details.
NeilBrown Jan. 6, 2025, 2:47 a.m. UTC | #7
On Mon, 06 Jan 2025, Chuck Lever wrote:
> On 1/5/25 5:28 PM, NeilBrown wrote:
> > On Sun, 05 Jan 2025, Chuck Lever wrote:
> >> On 1/3/25 5:53 PM, Tejun Heo wrote:
> >>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
> >>>> On Fri, 03 Jan 2025, cel@kernel.org wrote:
> >>> ...
> >>>> I think that instead of passing "list_lru_count()" we should pass some
> >>>> constant like 1024.
> >>>>
> >>>> cnt = list_lru_count()
> >>>> while (cnt > 0) {
> >>>>      num = min(cnt, 1024);
> >>>>      list_lru_walk(...., num);
> >>>>      cond_sched()
> >>>>      cnt -= num;
> >>>> }
> >>>>
> >>>> Then run it from system_wq.
> >>>>
> >>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a
> >>>> shrinker, and the pattern there is essentially that above.  A count is
> >>>> taken, possibly scaled down, then the shrinker is called in batches.
> >>>
> >>> BTW, there's nothing wrong with taking some msecs or even tens of msecs
> >>> running on system_unbound_wq, so the current state may be fine too.
> >>
> >> My thinking was that this work is low priority, so there should be
> >> plenty of opportunity to set it aside for a few moments and handle
> >> higher priority work. Maybe not worrisome on systems with a high core
> >> count, but on small SMP (eg VM guests), I've found that tasks like this
> >> can be rude neighbors.
> > 
> > None of the different work queues are expected to "set aside" the work
> > for more than a normal scheduling time slice.
> > system_long_wq was created "so that flush_scheduled_work() doesn't take
> > so long" but flush_scheduled_work() is never called now (and would
> > generate a warning if it was) so system_long_wq should really be
> > removed.
> > 
> > system_unbound_wq uses threads that are not bound to a CPU so the
> > scheduler can move them around.  That is most suitable for something
> > that might run for a long time.
> 
> That's not my understanding: unbound selects a CPU to run the work
> item on, and it can be (and usually is) not the same CPU as the one
> that invoked queue_work(). Then it isn't rescheduled. The work items
> are expected to complete quickly; work item termination is the typical
> reschedule point. My understanding, as always, could be stale.

If the work item will never schedule, then there isn't much point in
using a work queue.  The code can be run inline, or with a timer.

> 
> But that's neither here nor there: I've dropped the patch to replace
> system_unbound_wq with system_long_wq.
> 
> 
> > system_wq is bound to the CPU that schedules the work, but it can run
> > multiple work items - potentially long running ones - without trouble by
> > helping the scheduler share the CPU among them.  This requires that they
> > can sleep of course.
> > 
> > I haven't seen the actually symptoms that resulted in the various
> > changes to this code, but that last point is I think the key one.  The
> > work item, like all kernel code, needs to have scheduler points if it
> > might run for a longish time.  list_lru_walk() doesn't contain any
> > scheduling points so giving a large "nr_to_walk" is always a bad idea.
> 
> That might be a overstated... I don't see other consumers that are so
> concerned about rescheduling. But then it's not clear if they are
> dealing with lengthy LRU lists.

There are very few callers of list_lru_walk().
shrink_dcache_sb() is the closest analogue and it uses batches of 1024.
xfs_buftarg_drain() uses batches of LONG_MAX !!!

Mostly the lru is walked in a shrinker callback, and the shrinker
manages the batch size.

Thanks,
NeilBrown

> 
> 
> >> We could do this by adding a cond_resched() call site in the loop,
> >> or take Neil's suggestion of breaking up the free list across multiple
> >> work items that handle one or just a few file releases each.
> > 
> > I guess I should send a proper patch.
> 
> More comments in reply to that patch. Generally speaking, I'm
> comfortable chopping up the work as you propose, we just have to
> address some details.
> 
> 
> -- 
> Chuck Lever
>
diff mbox series

Patch

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index a1cdba42c4fa..91a535c2dede 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -112,7 +112,7 @@  static void
 nfsd_file_schedule_laundrette(void)
 {
 	if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags))
-		queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette,
+		queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette,
 				   NFSD_LAUNDRETTE_DELAY);
 }