[v2,0/9] SUNRPC service thread scheduler optimizations

Message ID	168842897573.139194.15893960758088950748.stgit@manet.1015granger.net (mailing list archive)
Headers	show Return-Path: <linux-nfs-owner@vger.kernel.org> Subject: [PATCH v2 0/9] SUNRPC service thread scheduler optimizations From: Chuck Lever <cel@kernel.org> To: linux-nfs@vger.kernel.org Cc: Chuck Lever <chuck.lever@oracle.com>, lorenzo@kernel.org, neilb@suse.de, jlayton@redhat.com, david@fromorbit.com Date: Mon, 03 Jul 2023 20:07:29 -0400 Message-ID: <168842897573.139194.15893960758088950748.stgit@manet.1015granger.net> User-Agent: StGit/1.5 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	SUNRPC service thread scheduler optimizations \| expand [v2,0/9] SUNRPC service thread scheduler optimizations [v2,1/9] SUNRPC: Deduplicate thread wake-up code [v2,2/9] SUNRPC: Report when no service thread is available. [v2,3/9] SUNRPC: Split the svc_xprt_dequeue tracepoint [v2,4/9] SUNRPC: Count ingress RPC messages per svc_pool [v2,5/9] SUNRPC: Clean up svc_set_num_threads [v2,6/9] SUNRPC: Replace dprintk() call site in __svc_create() [v2,7/9] SUNRPC: Don't disable BH's when taking sp_lock [v2,8/9] SUNRPC: Replace sp_threads_all with an xarray [v2,9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

Message ID

168842897573.139194.15893960758088950748.stgit@manet.1015granger.net (mailing list archive)

Headers

Subject: [PATCH v2 0/9] SUNRPC service thread scheduler optimizations
From: Chuck Lever <cel@kernel.org>
To: linux-nfs@vger.kernel.org
Cc: Chuck Lever <chuck.lever@oracle.com>, lorenzo@kernel.org,
        neilb@suse.de, jlayton@redhat.com, david@fromorbit.com
Date: Mon, 03 Jul 2023 20:07:29 -0400
Message-ID: 
 <168842897573.139194.15893960758088950748.stgit@manet.1015granger.net>
User-Agent: StGit/1.5
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Precedence: bulk

Series

SUNRPC service thread scheduler optimizations | expand

Message

Chuck Lever July 4, 2023, 12:07 a.m. UTC

Walking a linked list to find an idle thread is not CPU cache-
friendly, and in fact I've noted palpable per-request latency
impacts as the number of nfsd threads on the server increases.

After discussing some possible improvements with Jeff at LSF/MM,
I've been experimenting with the following series. I've measured an
order of magnitude latency improvement in the thread lookup time,
and have managed to keep the whole thing lockless.

After some offline discussion with Neil, I tried out using just
the xarray plus its spinlock to mark threads idle or busy. This
worked as well as the bitmap for lower thread counts, but got
predictably bad as the thread count when past several hundred. For
the moment I'm sticking with the wait-free bitmap lookup.

Also, the maximum thread count is now 4096. I'm still willing to try
an RCU-based bitmap resizing mechanism if we believe this is still
to small of a maximum.


Changes since RFC:
* Add a counter for ingress RPC messages
* Add a few documenting comments
* Move the more controversial patches to the end of the series
* Clarify the refactoring of svc_wake_up() 
* Increase the value of RPCSVC_MAXPOOLTHREADS to 4096 (and tested with that many threads)
* Optimize the loop in svc_pool_wake_idle_thread()
* Optimize marking a thread "idle" in svc_get_next_xprt()

---

Chuck Lever (9):
      SUNRPC: Deduplicate thread wake-up code
      SUNRPC: Report when no service thread is available.
      SUNRPC: Split the svc_xprt_dequeue tracepoint
      SUNRPC: Count ingress RPC messages per svc_pool
      SUNRPC: Clean up svc_set_num_threads
      SUNRPC: Replace dprintk() call site in __svc_create()
      SUNRPC: Don't disable BH's when taking sp_lock
      SUNRPC: Replace sp_threads_all with an xarray
      SUNRPC: Convert RQ_BUSY into a per-pool bitmap


 fs/nfsd/nfssvc.c              |   3 +-
 include/linux/sunrpc/svc.h    |  18 ++--
 include/trace/events/sunrpc.h | 159 +++++++++++++++++++++++++++----
 net/sunrpc/svc.c              | 174 ++++++++++++++++++++++------------
 net/sunrpc/svc_xprt.c         | 114 +++++++++++-----------
 5 files changed, 328 insertions(+), 140 deletions(-)

--
Chuck Lever

Comments

NeilBrown July 5, 2023, 1:08 a.m. UTC | #1

I've been pondering this scheduling mechanism in sunrpc/svc some more,
and I wonder if rather than optimising the search, we should eliminate
it.

Instead we could have a linked list of idle threads using llist.h

svc_enqueue_xprt calls llist_del_first() and if the result is not NULL,
that thread is deemed busy (because it isn't on the list) and is woken.

choose_victim() could also use llist_del_first().  If nothing is there
it could set a flag which gets cleared by the next thread to go idle.
That thread exits ..  or something.  Some interlock would be needed but
it shouldn't be too hard.

svc_exit_thread would have difficulty removing itself from the idle
list, if it wasn't busy..  Possibly we could disallow that case (I think
sending a signal to a thread can make it exit while idle).
Alternately we could use llist_del_all() to clear the list, then wake
them all up so that they go back on the list if there is nothing to do
and if they aren't trying to exit.  That is fairly heavy handed, but
isn't a case that we need to optimise.

If you think that might be worth pursuing, I could have a go at writing
the patch - probably on top of all the cleanups in your series before
the xarray is added.

I also wonder if we should avoid waking too many threads up at once.
If multiple events happen in quick succession, we currently wake up
multiple threads and if there is any scheduling delay (which is expected
based on Commit 22700f3c6df5 ("SUNRPC: Improve ordering of transport processing"))
then by the time the threads wake up, there may no longer be work to do
as another thread might have gone idle and taken the work.

Instead we could have a limit on the number of threads waking up -
possibly 1 or 3.  If the counter is maxed out, don't do a wake up.
When a thread wakes up, it decrements the counter, dequeues some work,
and if there is more to do, then it queues another task.
I imagine the same basic protocol would be used for creating new threads
when load is high - start just one at a time, though maybe a new thread
would handle a first request before possibly starting another thread.

But this is a stretch goal - not the main focus.

Thanks,
NeilBrown

Chuck Lever July 5, 2023, 2:43 a.m. UTC | #2

On Wed, Jul 05, 2023 at 11:08:32AM +1000, NeilBrown wrote:
> 
> I've been pondering this scheduling mechanism in sunrpc/svc some more,
> and I wonder if rather than optimising the search, we should eliminate
> it.
> 
> Instead we could have a linked list of idle threads using llist.h
> 
> svc_enqueue_xprt calls llist_del_first() and if the result is not NULL,
> that thread is deemed busy (because it isn't on the list) and is woken.
> 
> choose_victim() could also use llist_del_first().  If nothing is there
> it could set a flag which gets cleared by the next thread to go idle.
> That thread exits ..  or something.  Some interlock would be needed but
> it shouldn't be too hard.
> 
> svc_exit_thread would have difficulty removing itself from the idle
> list, if it wasn't busy..  Possibly we could disallow that case (I think
> sending a signal to a thread can make it exit while idle).
> Alternately we could use llist_del_all() to clear the list, then wake
> them all up so that they go back on the list if there is nothing to do
> and if they aren't trying to exit.  That is fairly heavy handed, but
> isn't a case that we need to optimise.
> 
> If you think that might be worth pursuing, I could have a go at writing
> the patch - probably on top of all the cleanups in your series before
> the xarray is added.

The thread pool is effectively a cached resource, so it is a use case
that fits llist well. svcrdma uses llist in a few spots in that very
capacity.

If you think you can meet all of the criteria in the table at the top of
llist.h so thread scheduling works entirely without a lock, that might
be an interesting point of comparison.

My only concern is that the current set of candidate mechanisms manage
to use mostly the first thread, rather than round-robining through the
thread list. Using mostly one process tends to be more cache-friendly.
An llist-based thread scheduler should try to follow that behavior,
IMO.

> I also wonder if we should avoid waking too many threads up at once.
> If multiple events happen in quick succession, we currently wake up
> multiple threads and if there is any scheduling delay (which is expected
> based on Commit 22700f3c6df5 ("SUNRPC: Improve ordering of transport processing"))
> then by the time the threads wake up, there may no longer be work to do
> as another thread might have gone idle and taken the work.

It might be valuable to add some observability of wake-ups that find
nothing to do. I'll look into that.

> Instead we could have a limit on the number of threads waking up -
> possibly 1 or 3.  If the counter is maxed out, don't do a wake up.
> When a thread wakes up, it decrements the counter, dequeues some work,
> and if there is more to do, then it queues another task.

I consider reducing the wake-up rate as the next step for improving
RPC service thread scalability. Any experimentation in that area is
worth looking into.

> I imagine the same basic protocol would be used for creating new threads
> when load is high - start just one at a time, though maybe a new thread
> would handle a first request before possibly starting another thread.

I envision that dynamically tuning the pool thread count as something
that should be managed in user space, since it's a policy rather than
a mechanism.

That could be a problem, though, if we wanted to shut down a few pool
threads based on shrinker activity.