Message ID | 20250103010002.619062-2-cel@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [v1,1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" | expand |
On Thu, 2025-01-02 at 20:00 -0500, cel@kernel.org wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem > complaining about how long the filecache laundrette was running. > This resulted in switching from using the system_wq for the > laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use > system_unbound_wq for nfsd_file_gc_worker()"). > > However, I've seen the laundrette running for multiple milliseconds > on some workloads, delaying other work. For the purpose of > scheduling fairness, perhaps a better choice would be to process > filecache disposal queues on the system_long_wq instead. > > Cc: Tejun Heo <tj@kernel.org> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > fs/nfsd/filecache.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index a1cdba42c4fa..91a535c2dede 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -112,7 +112,7 @@ static void > nfsd_file_schedule_laundrette(void) > { > if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags)) > - queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette, > + queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette, > NFSD_LAUNDRETTE_DELAY); > } > TIL that there is system_long_wq! Seems like a reasonable thing to do since this is generally low-priority work. Reviewed-by: Jeff Layton <jlayton@kernel.org>
On Fri, 03 Jan 2025, cel@kernel.org wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem > complaining about how long the filecache laundrette was running. > This resulted in switching from using the system_wq for the > laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use > system_unbound_wq for nfsd_file_gc_worker()"). > > However, I've seen the laundrette running for multiple milliseconds > on some workloads, delaying other work. For the purpose of > scheduling fairness, perhaps a better choice would be to process > filecache disposal queues on the system_long_wq instead. > > Cc: Tejun Heo <tj@kernel.org> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > fs/nfsd/filecache.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index a1cdba42c4fa..91a535c2dede 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -112,7 +112,7 @@ static void > nfsd_file_schedule_laundrette(void) > { > if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags)) > - queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette, > + queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette, > NFSD_LAUNDRETTE_DELAY); This doesn't seem right to me. "unbound_wq" is not bound to an CPU. It just gets allocated that thread which shouldn't interfere with any other work (except for normal scheduling contention) until the queue limit is reach, which is unlikely to happen often. "long_wq" IS bound to a CPU and is concurrency managed so it could interfere with other things. The idea is that work items scheduled there are not in a hurry and could take longer. But I dont' think they should take very long... I think the problem is elsewhere. list_lru_walk() can hold a spin lock for as long as it takes to walk nr_to_walk objects. This will tie up the CPU no matter which work queue it is running on. I think that instead of passing "list_lru_count()" we should pass some constant like 1024. cnt = list_lru_count() while (cnt > 0) { num = min(cnt, 1024); list_lru_walk(...., num); cond_sched() cnt -= num; } Then run it from system_wq. list_lru_shrink is most often called as list_lru_shrink_walk() from a shrinker, and the pattern there is essentially that above. A count is taken, possibly scaled down, then the shrinker is called in batches. NeilBrown
On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: > On Fri, 03 Jan 2025, cel@kernel.org wrote: ... > I think that instead of passing "list_lru_count()" we should pass some > constant like 1024. > > cnt = list_lru_count() > while (cnt > 0) { > num = min(cnt, 1024); > list_lru_walk(...., num); > cond_sched() > cnt -= num; > } > > Then run it from system_wq. > > list_lru_shrink is most often called as list_lru_shrink_walk() from a > shrinker, and the pattern there is essentially that above. A count is > taken, possibly scaled down, then the shrinker is called in batches. BTW, there's nothing wrong with taking some msecs or even tens of msecs running on system_unbound_wq, so the current state may be fine too. Thanks.
On 1/3/25 5:53 PM, Tejun Heo wrote: > On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: >> On Fri, 03 Jan 2025, cel@kernel.org wrote: > ... >> I think that instead of passing "list_lru_count()" we should pass some >> constant like 1024. >> >> cnt = list_lru_count() >> while (cnt > 0) { >> num = min(cnt, 1024); >> list_lru_walk(...., num); >> cond_sched() >> cnt -= num; >> } >> >> Then run it from system_wq. >> >> list_lru_shrink is most often called as list_lru_shrink_walk() from a >> shrinker, and the pattern there is essentially that above. A count is >> taken, possibly scaled down, then the shrinker is called in batches. > > BTW, there's nothing wrong with taking some msecs or even tens of msecs > running on system_unbound_wq, so the current state may be fine too. My thinking was that this work is low priority, so there should be plenty of opportunity to set it aside for a few moments and handle higher priority work. Maybe not worrisome on systems with a high core count, but on small SMP (eg VM guests), I've found that tasks like this can be rude neighbors. We could do this by adding a cond_resched() call site in the loop, or take Neil's suggestion of breaking up the free list across multiple work items that handle one or just a few file releases each.
On Sun, 05 Jan 2025, Chuck Lever wrote: > On 1/3/25 5:53 PM, Tejun Heo wrote: > > On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: > >> On Fri, 03 Jan 2025, cel@kernel.org wrote: > > ... > >> I think that instead of passing "list_lru_count()" we should pass some > >> constant like 1024. > >> > >> cnt = list_lru_count() > >> while (cnt > 0) { > >> num = min(cnt, 1024); > >> list_lru_walk(...., num); > >> cond_sched() > >> cnt -= num; > >> } > >> > >> Then run it from system_wq. > >> > >> list_lru_shrink is most often called as list_lru_shrink_walk() from a > >> shrinker, and the pattern there is essentially that above. A count is > >> taken, possibly scaled down, then the shrinker is called in batches. > > > > BTW, there's nothing wrong with taking some msecs or even tens of msecs > > running on system_unbound_wq, so the current state may be fine too. > > My thinking was that this work is low priority, so there should be > plenty of opportunity to set it aside for a few moments and handle > higher priority work. Maybe not worrisome on systems with a high core > count, but on small SMP (eg VM guests), I've found that tasks like this > can be rude neighbors. None of the different work queues are expected to "set aside" the work for more than a normal scheduling time slice. system_long_wq was created "so that flush_scheduled_work() doesn't take so long" but flush_scheduled_work() is never called now (and would generate a warning if it was) so system_long_wq should really be removed. system_unbound_wq uses threads that are not bound to a CPU so the scheduler can move them around. That is most suitable for something that might run for a long time. system_wq is bound to the CPU that schedules the work, but it can run multiple work items - potentially long running ones - without trouble by helping the scheduler share the CPU among them. This requires that they can sleep of course. I haven't seen the actually symptoms that resulted in the various changes to this code, but that last point is I think the key one. The work item, like all kernel code, needs to have scheduler points if it might run for a longish time. list_lru_walk() doesn't contain any scheduling points so giving a large "nr_to_walk" is always a bad idea. > > We could do this by adding a cond_resched() call site in the loop, > or take Neil's suggestion of breaking up the free list across multiple > work items that handle one or just a few file releases each. I guess I should send a proper patch. NeilBrown > > -- > Chuck Lever >
On 1/5/25 5:28 PM, NeilBrown wrote: > On Sun, 05 Jan 2025, Chuck Lever wrote: >> On 1/3/25 5:53 PM, Tejun Heo wrote: >>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: >>>> On Fri, 03 Jan 2025, cel@kernel.org wrote: >>> ... >>>> I think that instead of passing "list_lru_count()" we should pass some >>>> constant like 1024. >>>> >>>> cnt = list_lru_count() >>>> while (cnt > 0) { >>>> num = min(cnt, 1024); >>>> list_lru_walk(...., num); >>>> cond_sched() >>>> cnt -= num; >>>> } >>>> >>>> Then run it from system_wq. >>>> >>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a >>>> shrinker, and the pattern there is essentially that above. A count is >>>> taken, possibly scaled down, then the shrinker is called in batches. >>> >>> BTW, there's nothing wrong with taking some msecs or even tens of msecs >>> running on system_unbound_wq, so the current state may be fine too. >> >> My thinking was that this work is low priority, so there should be >> plenty of opportunity to set it aside for a few moments and handle >> higher priority work. Maybe not worrisome on systems with a high core >> count, but on small SMP (eg VM guests), I've found that tasks like this >> can be rude neighbors. > > None of the different work queues are expected to "set aside" the work > for more than a normal scheduling time slice. > system_long_wq was created "so that flush_scheduled_work() doesn't take > so long" but flush_scheduled_work() is never called now (and would > generate a warning if it was) so system_long_wq should really be > removed. > > system_unbound_wq uses threads that are not bound to a CPU so the > scheduler can move them around. That is most suitable for something > that might run for a long time. That's not my understanding: unbound selects a CPU to run the work item on, and it can be (and usually is) not the same CPU as the one that invoked queue_work(). Then it isn't rescheduled. The work items are expected to complete quickly; work item termination is the typical reschedule point. My understanding, as always, could be stale. But that's neither here nor there: I've dropped the patch to replace system_unbound_wq with system_long_wq. > system_wq is bound to the CPU that schedules the work, but it can run > multiple work items - potentially long running ones - without trouble by > helping the scheduler share the CPU among them. This requires that they > can sleep of course. > > I haven't seen the actually symptoms that resulted in the various > changes to this code, but that last point is I think the key one. The > work item, like all kernel code, needs to have scheduler points if it > might run for a longish time. list_lru_walk() doesn't contain any > scheduling points so giving a large "nr_to_walk" is always a bad idea. That might be a overstated... I don't see other consumers that are so concerned about rescheduling. But then it's not clear if they are dealing with lengthy LRU lists. >> We could do this by adding a cond_resched() call site in the loop, >> or take Neil's suggestion of breaking up the free list across multiple >> work items that handle one or just a few file releases each. > > I guess I should send a proper patch. More comments in reply to that patch. Generally speaking, I'm comfortable chopping up the work as you propose, we just have to address some details.
On Mon, 06 Jan 2025, Chuck Lever wrote: > On 1/5/25 5:28 PM, NeilBrown wrote: > > On Sun, 05 Jan 2025, Chuck Lever wrote: > >> On 1/3/25 5:53 PM, Tejun Heo wrote: > >>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: > >>>> On Fri, 03 Jan 2025, cel@kernel.org wrote: > >>> ... > >>>> I think that instead of passing "list_lru_count()" we should pass some > >>>> constant like 1024. > >>>> > >>>> cnt = list_lru_count() > >>>> while (cnt > 0) { > >>>> num = min(cnt, 1024); > >>>> list_lru_walk(...., num); > >>>> cond_sched() > >>>> cnt -= num; > >>>> } > >>>> > >>>> Then run it from system_wq. > >>>> > >>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a > >>>> shrinker, and the pattern there is essentially that above. A count is > >>>> taken, possibly scaled down, then the shrinker is called in batches. > >>> > >>> BTW, there's nothing wrong with taking some msecs or even tens of msecs > >>> running on system_unbound_wq, so the current state may be fine too. > >> > >> My thinking was that this work is low priority, so there should be > >> plenty of opportunity to set it aside for a few moments and handle > >> higher priority work. Maybe not worrisome on systems with a high core > >> count, but on small SMP (eg VM guests), I've found that tasks like this > >> can be rude neighbors. > > > > None of the different work queues are expected to "set aside" the work > > for more than a normal scheduling time slice. > > system_long_wq was created "so that flush_scheduled_work() doesn't take > > so long" but flush_scheduled_work() is never called now (and would > > generate a warning if it was) so system_long_wq should really be > > removed. > > > > system_unbound_wq uses threads that are not bound to a CPU so the > > scheduler can move them around. That is most suitable for something > > that might run for a long time. > > That's not my understanding: unbound selects a CPU to run the work > item on, and it can be (and usually is) not the same CPU as the one > that invoked queue_work(). Then it isn't rescheduled. The work items > are expected to complete quickly; work item termination is the typical > reschedule point. My understanding, as always, could be stale. If the work item will never schedule, then there isn't much point in using a work queue. The code can be run inline, or with a timer. > > But that's neither here nor there: I've dropped the patch to replace > system_unbound_wq with system_long_wq. > > > > system_wq is bound to the CPU that schedules the work, but it can run > > multiple work items - potentially long running ones - without trouble by > > helping the scheduler share the CPU among them. This requires that they > > can sleep of course. > > > > I haven't seen the actually symptoms that resulted in the various > > changes to this code, but that last point is I think the key one. The > > work item, like all kernel code, needs to have scheduler points if it > > might run for a longish time. list_lru_walk() doesn't contain any > > scheduling points so giving a large "nr_to_walk" is always a bad idea. > > That might be a overstated... I don't see other consumers that are so > concerned about rescheduling. But then it's not clear if they are > dealing with lengthy LRU lists. There are very few callers of list_lru_walk(). shrink_dcache_sb() is the closest analogue and it uses batches of 1024. xfs_buftarg_drain() uses batches of LONG_MAX !!! Mostly the lru is walked in a shrinker callback, and the shrinker manages the batch size. Thanks, NeilBrown > > > >> We could do this by adding a cond_resched() call site in the loop, > >> or take Neil's suggestion of breaking up the free list across multiple > >> work items that handle one or just a few file releases each. > > > > I guess I should send a proper patch. > > More comments in reply to that patch. Generally speaking, I'm > comfortable chopping up the work as you propose, we just have to > address some details. > > > -- > Chuck Lever >
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index a1cdba42c4fa..91a535c2dede 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -112,7 +112,7 @@ static void nfsd_file_schedule_laundrette(void) { if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags)) - queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette, + queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette, NFSD_LAUNDRETTE_DELAY); }