mbox series

[RFC,0/8] RPC service thread scheduler optimizations

Message ID 168806401782.1034990.9686296943273298604.stgit@morisot.1015granger.net (mailing list archive)
Headers show
Series RPC service thread scheduler optimizations | expand

Message

Chuck Lever June 29, 2023, 6:42 p.m. UTC
Hi -

Walking a linked list to find an idle thread is not CPU cache-
friendly, and in fact I've noted palpable per-request latency
impacts as the number of nfsd threads on the server increases.

After discussing some possible improvements with Jeff at LSF/MM,
I've been experimenting with the following series. I've measured an
order of magnitude latency improvement in the thread lookup time,
and have managed to keep the whole thing lockless.

The only thing I don't like is that allocating the idle bitmaps in
advance means we've got an /a priori/ cap on the number of NFSD
threads that can be created. I'd love to find a way to enable
the pool idle bitmaps to expand dynamically. Suggestions welcome.

Cc: Lorenzo - I'm not sure whether this work would be applied first
or whether your rpc_status patch would be. If yours goes first, I
can handle converting nfsd_rpc_status() to use the pool's thread
xarray.

---

Chuck Lever (8):
      SUNRPC: Deduplicate thread wake-up code
      SUNRPC: Report when no service thread is available.
      SUNRPC: Split the svc_xprt_dequeue tracepoint
      SUNRPC: Clean up svc_set_num_threads
      SUNRPC: Replace dprintk() call site in __svc_create()
      SUNRPC: Replace sp_threads_all with an xarray
      SUNRPC: Convert RQ_BUSY into a per-pool bitmap
      SUNRPC: Don't disable BH's when taking sp_lock


 fs/nfsd/nfssvc.c              |   3 +-
 include/linux/sunrpc/svc.h    |  17 ++--
 include/trace/events/sunrpc.h | 160 ++++++++++++++++++++++++++++----
 net/sunrpc/svc.c              | 169 ++++++++++++++++++++++------------
 net/sunrpc/svc_xprt.c         | 103 +++++++++++----------
 5 files changed, 316 insertions(+), 136 deletions(-)

--
Chuck Lever

Comments

NeilBrown June 30, 2023, 2:42 a.m. UTC | #1
On Fri, 30 Jun 2023, Chuck Lever wrote:
> Hi -
> 
> Walking a linked list to find an idle thread is not CPU cache-
> friendly, and in fact I've noted palpable per-request latency
> impacts as the number of nfsd threads on the server increases.
> 
> After discussing some possible improvements with Jeff at LSF/MM,
> I've been experimenting with the following series. I've measured an
> order of magnitude latency improvement in the thread lookup time,
> and have managed to keep the whole thing lockless.
> 
> The only thing I don't like is that allocating the idle bitmaps in
> advance means we've got an /a priori/ cap on the number of NFSD
> threads that can be created. I'd love to find a way to enable
> the pool idle bitmaps to expand dynamically. Suggestions welcome.

Hi Chuck,
 The series looks good.  I did notice that patch 6/8 used UINT_MAX and
 U32_MAX in different places for the same number, though the next patch
 replaced them both for a new number - the same in both places now.

 I agree that an a priori cap on number of threads is not ideal.
 Have you considered using the xarray to only store busy threads?
 I think its lookup mechanism mostly relies on a bitmap of present
 entries, but I'm not completely sure.

 That would require some extra work for svc_stop_threads() which is the
 only place we are interested in threads that aren't busy.  We would
 need to record a target number of threads, and whenever a thread
 becomes idle it checks if the number is exceeded.  If so it exits
 decrementing the number of threads, other wise it re-inserts into the
 xa (if it cannot find a transport to handle).

 Alternately we could store bitmaps as values in an xarray, much like
 the ida code does.

Thanks,
NeilBrown
NeilBrown June 30, 2023, 3:04 a.m. UTC | #2
On Fri, 30 Jun 2023, NeilBrown wrote:
> 
>  I agree that an a priori cap on number of threads is not ideal.
>  Have you considered using the xarray to only store busy threads?
>  I think its lookup mechanism mostly relies on a bitmap of present
>  entries, but I'm not completely sure.

It might be event better to use xa_set_mark() and xa_clear_mark() to
manage the busy state.
These are not atomic so you would need an atomic operation in the rqst.

#define XA_MARK_IDLE XA_MARK_1

do {
 rqstp = xa_find(xa, ..., XA_MARK_IDLE);
 if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags)) {
     xa_clear_mark(xa, rqstp->xa_index, XA_MARK_IDLE);
     break;
 }
} while {rqstp);

xa_find() should be nearly as fast at find_next_bit()
  
NeilBrown
Chuck Lever June 30, 2023, 3:42 a.m. UTC | #3
> On Jun 29, 2023, at 11:04 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Fri, 30 Jun 2023, NeilBrown wrote:
>> 
>> I agree that an a priori cap on number of threads is not ideal.
>> Have you considered using the xarray to only store busy threads?
>> I think its lookup mechanism mostly relies on a bitmap of present
>> entries, but I'm not completely sure.
> 
> It might be event better to use xa_set_mark() and xa_clear_mark() to
> manage the busy state.
> These are not atomic so you would need an atomic operation in the rqst.
> 
> #define XA_MARK_IDLE XA_MARK_1
> 
> do {
> rqstp = xa_find(xa, ..., XA_MARK_IDLE);
> if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags)) {
>     xa_clear_mark(xa, rqstp->xa_index, XA_MARK_IDLE);
>     break;
> }
> } while {rqstp);
> 
> xa_find() should be nearly as fast at find_next_bit()

Yes, xa_find() is fast, and takes only the RCU read lock. Unfortunately,
xa_clear_mark and xa_set_mark (and adding and removing items) require
taking the xa_lock().

The current list-based mechanism is lockless, and so is the replacement
I've proposed. Jeff and I regarded the addition of a spin lock in the
select-an-idle-thread-and-wake-it-up function as something we want to
avoid.

Thus the pool's idle bitmap enables the use of atomic bit testing,
setting, and clearing to make it unnecessary to hold a spin lock while
selecting and marking the chosen thread. After selection, only the RCU
read lock is needed to safely map that bit position to an svc_rqst
pointer.

We might be able to use another sp_flags bit to cause thread selection
to sleep, if we also protect the thread selection process with the RCU
read lock. Flip the bit on, wait one RCU grace period, and it should be
safe to copy and replace the idle bitmap. Once bitmap replacement is
complete, clear the bit and wake up any threads waiting to find an idle
worker.

The balancing act is whether we want to add that kind of cleverness to
handle a situation that will be relatively rare. Is it sensible to
allow more than a few thousand RPC service threads in a pool?


--
Chuck Lever