[v2,9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

Message ID	168842930872.139194.10164846167275218299.stgit@manet.1015granger.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@vger.kernel.org> Subject: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap From: Chuck Lever <cel@kernel.org> To: linux-nfs@vger.kernel.org Cc: Chuck Lever <chuck.lever@oracle.com>, lorenzo@kernel.org, neilb@suse.de, jlayton@redhat.com, david@fromorbit.com Date: Mon, 03 Jul 2023 20:08:28 -0400 Message-ID: <168842930872.139194.10164846167275218299.stgit@manet.1015granger.net> In-Reply-To: <168842897573.139194.15893960758088950748.stgit@manet.1015granger.net> References: <168842897573.139194.15893960758088950748.stgit@manet.1015granger.net> User-Agent: StGit/1.5 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	SUNRPC service thread scheduler optimizations \| expand [v2,0/9] SUNRPC service thread scheduler optimizations [v2,1/9] SUNRPC: Deduplicate thread wake-up code [v2,2/9] SUNRPC: Report when no service thread is available. [v2,3/9] SUNRPC: Split the svc_xprt_dequeue tracepoint [v2,4/9] SUNRPC: Count ingress RPC messages per svc_pool [v2,5/9] SUNRPC: Clean up svc_set_num_threads [v2,6/9] SUNRPC: Replace dprintk() call site in __svc_create() [v2,7/9] SUNRPC: Don't disable BH's when taking sp_lock [v2,8/9] SUNRPC: Replace sp_threads_all with an xarray [v2,9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

Chuck Lever July 4, 2023, 12:08 a.m. UTC

From: Chuck Lever <chuck.lever@oracle.com>

I've noticed that client-observed server request latency goes up
simply when the nfsd thread count is increased.

List walking is known to be memory-inefficient. On a busy server
with many threads, enqueuing a transport will walk the "all threads"
list quite frequently. This also pulls in the cache lines for some
hot fields in each svc_rqst (namely, rq_flags).

The svc_xprt_enqueue() call that concerns me most is the one in
svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
down completion handling limits the total throughput per RDMA
connection.

So, avoid walking the "all threads" list to find an idle thread to
wake. Instead, set up an idle bitmap and use find_next_bit, which
should work the same way as RQ_BUSY but it will touch only the
cachelines that the bitmap is in. Stick with atomic bit operations
to avoid taking the pool lock.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h    |    6 ++++--
 include/trace/events/sunrpc.h |    1 -
 net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
 net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
 4 files changed, 49 insertions(+), 15 deletions(-)

NeilBrown July 4, 2023, 1:26 a.m. UTC | #1

On Tue, 04 Jul 2023, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> I've noticed that client-observed server request latency goes up
> simply when the nfsd thread count is increased.
> 
> List walking is known to be memory-inefficient. On a busy server
> with many threads, enqueuing a transport will walk the "all threads"
> list quite frequently. This also pulls in the cache lines for some
> hot fields in each svc_rqst (namely, rq_flags).

I think this text could usefully be re-written.  By this point in the
series we aren't list walking.

I'd also be curious to know what latency different you get for just this
change.

> 
> The svc_xprt_enqueue() call that concerns me most is the one in
> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> down completion handling limits the total throughput per RDMA
> connection.
> 
> So, avoid walking the "all threads" list to find an idle thread to
> wake. Instead, set up an idle bitmap and use find_next_bit, which
> should work the same way as RQ_BUSY but it will touch only the
> cachelines that the bitmap is in. Stick with atomic bit operations
> to avoid taking the pool lock.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/svc.h    |    6 ++++--
>  include/trace/events/sunrpc.h |    1 -
>  net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
>  net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
>  4 files changed, 49 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 6f8bfcd44250..27ffcf7371d0 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -35,6 +35,7 @@ struct svc_pool {
>  	spinlock_t		sp_lock;	/* protects sp_sockets */
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> +	unsigned long		*sp_idle_map;	/* idle threads */
>  	struct xarray		sp_thread_xa;
>  
>  	/* statistics on pool operation */
> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>  #define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
>  				+ 2 + 1)
>  
> +#define RPCSVC_MAXPOOLTHREADS	(4096)
> +
>  /*
>   * The context of a single thread, including the request currently being
>   * processed.
> @@ -239,8 +242,7 @@ struct svc_rqst {
>  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
>  						 * to prevent encrypting page
>  						 * cache pages */
> -#define	RQ_BUSY		(5)			/* request is busy */
> -#define	RQ_DATA		(6)			/* request has data */
> +#define	RQ_DATA		(5)			/* request has data */

Might this be a good opportunity to convert this to an enum ??

>  	unsigned long		rq_flags;	/* flags field */
>  	u32			rq_thread_id;	/* xarray index */
>  	ktime_t			rq_qtime;	/* enqueue time */
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index ea43c6059bdb..c07824a254bf 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>  	svc_rqst_flag(USEDEFERRAL)					\
>  	svc_rqst_flag(DROPME)						\
>  	svc_rqst_flag(SPLICE_OK)					\
> -	svc_rqst_flag(BUSY)						\
>  	svc_rqst_flag_end(DATA)
>  
>  #undef svc_rqst_flag
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index ef350f0d8925..d0278e5190ba 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>  		INIT_LIST_HEAD(&pool->sp_sockets);
>  		spin_lock_init(&pool->sp_lock);
>  		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> +		/* All threads initially marked "busy" */
> +		pool->sp_idle_map =
> +			bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> +					   svc_pool_map_get_node(i));
> +		if (!pool->sp_idle_map)
> +			return NULL;
>  
>  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
>  		percpu_counter_destroy(&pool->sp_threads_starved);
>  
>  		xa_destroy(&pool->sp_thread_xa);
> +		bitmap_free(pool->sp_idle_map);
> +		pool->sp_idle_map = NULL;
>  	}
>  	kfree(serv->sv_pools);
>  	kfree(serv);
> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>  
>  	folio_batch_init(&rqstp->rq_fbatch);
>  
> -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
>  	rqstp->rq_server = serv;
>  	rqstp->rq_pool = pool;
>  
> @@ -677,7 +684,7 @@ static struct svc_rqst *
>  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  {
>  	static const struct xa_limit limit = {
> -		.max = U32_MAX,
> +		.max = RPCSVC_MAXPOOLTHREADS,
>  	};
>  	struct svc_rqst	*rqstp;
>  	int ret;
> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>  					   struct svc_pool *pool)
>  {
>  	struct svc_rqst	*rqstp;
> -	unsigned long index;
> +	unsigned long bit;
>  
> -	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> +	/* Check the pool's idle bitmap locklessly so that multiple
> +	 * idle searches can proceed concurrently.
> +	 */
> +	for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> +		if (!test_and_clear_bit(bit, pool->sp_idle_map))
>  			continue;

I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
and so get all the required memory barriers.
What we are doing here is locking a particular thread for a task, so
"lock" is an appropriate description of what is happening.
See also svc_pool_thread_mark_* below.

>  
> +		rqstp = xa_load(&pool->sp_thread_xa, bit);
> +		if (!rqstp)
> +			break;
> +
>  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>  		wake_up_process(rqstp->rq_task);
>  		percpu_counter_inc(&pool->sp_threads_woken);
> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
>  	}
>  
>  found_pool:
> -	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> +	rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> +			XA_PRESENT);
>  	if (rqstp) {
>  		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>  		task = rqstp->rq_task;
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 7709120b45c1..2844b32c16ea 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>  	return true;
>  }
>  
> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> +				      struct svc_rqst *rqstp)
> +{
> +	smp_mb__before_atomic();
> +	set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> +	smp_mb__after_atomic();
> +}

There memory barriers above and below bother me.  There is no comment
telling me what they are protecting against.
I would rather svc_pool_thread_mark_idle - which unlocks the thread -
were

    clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);

and  that svc_pool_thread_mark_busy were

   test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);

Then it would be more obvious what was happening.

Thanks,
NeilBrown

> +
> +/*
> + * Note: If we were awoken, then this rqstp has already been marked busy.
> + */
> +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> +				      struct svc_rqst *rqstp)
> +{
> +	smp_mb__before_atomic();
> +	clear_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> +	smp_mb__after_atomic();
> +}
> +
>  static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  {
>  	struct svc_pool		*pool = rqstp->rq_pool;
> @@ -756,18 +775,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  	set_current_state(TASK_INTERRUPTIBLE);
>  	smp_mb__before_atomic();
>  	clear_bit(SP_CONGESTED, &pool->sp_flags);
> -	clear_bit(RQ_BUSY, &rqstp->rq_flags);
> -	smp_mb__after_atomic();
>  
> -	if (likely(rqst_should_sleep(rqstp)))
> +	if (likely(rqst_should_sleep(rqstp))) {
> +		svc_pool_thread_mark_idle(pool, rqstp);
>  		time_left = schedule_timeout(timeout);
> -	else
> +	} else
>  		__set_current_state(TASK_RUNNING);
>  
>  	try_to_freeze();
>  
> -	set_bit(RQ_BUSY, &rqstp->rq_flags);
> -	smp_mb__after_atomic();
> +	svc_pool_thread_mark_busy(pool, rqstp);
> +
>  	rqstp->rq_xprt = svc_xprt_dequeue(pool);
>  	if (rqstp->rq_xprt) {
>  		trace_svc_pool_awoken(rqstp);
> 
> 
>

Chuck Lever July 4, 2023, 2:02 a.m. UTC | #2

On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> On Tue, 04 Jul 2023, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > I've noticed that client-observed server request latency goes up
> > simply when the nfsd thread count is increased.
> > 
> > List walking is known to be memory-inefficient. On a busy server
> > with many threads, enqueuing a transport will walk the "all threads"
> > list quite frequently. This also pulls in the cache lines for some
> > hot fields in each svc_rqst (namely, rq_flags).
> 
> I think this text could usefully be re-written.  By this point in the
> series we aren't list walking.
> 
> I'd also be curious to know what latency different you get for just this
> change.

Not much of a latency difference at lower thread counts.

The difference I notice is that with the spinlock version of
pool_wake_idle_thread, there is significant lock contention as
the thread count increases, and the throughput result of my fio
test is lower (outside the result variance).


> > The svc_xprt_enqueue() call that concerns me most is the one in
> > svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> > down completion handling limits the total throughput per RDMA
> > connection.
> > 
> > So, avoid walking the "all threads" list to find an idle thread to
> > wake. Instead, set up an idle bitmap and use find_next_bit, which
> > should work the same way as RQ_BUSY but it will touch only the
> > cachelines that the bitmap is in. Stick with atomic bit operations
> > to avoid taking the pool lock.
> > 
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  include/linux/sunrpc/svc.h    |    6 ++++--
> >  include/trace/events/sunrpc.h |    1 -
> >  net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
> >  net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
> >  4 files changed, 49 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 6f8bfcd44250..27ffcf7371d0 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -35,6 +35,7 @@ struct svc_pool {
> >  	spinlock_t		sp_lock;	/* protects sp_sockets */
> >  	struct list_head	sp_sockets;	/* pending sockets */
> >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> > +	unsigned long		*sp_idle_map;	/* idle threads */
> >  	struct xarray		sp_thread_xa;
> >  
> >  	/* statistics on pool operation */
> > @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >  #define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> >  				+ 2 + 1)
> >  
> > +#define RPCSVC_MAXPOOLTHREADS	(4096)
> > +
> >  /*
> >   * The context of a single thread, including the request currently being
> >   * processed.
> > @@ -239,8 +242,7 @@ struct svc_rqst {
> >  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> >  						 * to prevent encrypting page
> >  						 * cache pages */
> > -#define	RQ_BUSY		(5)			/* request is busy */
> > -#define	RQ_DATA		(6)			/* request has data */
> > +#define	RQ_DATA		(5)			/* request has data */
> 
> Might this be a good opportunity to convert this to an enum ??
> 
> >  	unsigned long		rq_flags;	/* flags field */
> >  	u32			rq_thread_id;	/* xarray index */
> >  	ktime_t			rq_qtime;	/* enqueue time */
> > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > index ea43c6059bdb..c07824a254bf 100644
> > --- a/include/trace/events/sunrpc.h
> > +++ b/include/trace/events/sunrpc.h
> > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> >  	svc_rqst_flag(USEDEFERRAL)					\
> >  	svc_rqst_flag(DROPME)						\
> >  	svc_rqst_flag(SPLICE_OK)					\
> > -	svc_rqst_flag(BUSY)						\
> >  	svc_rqst_flag_end(DATA)
> >  
> >  #undef svc_rqst_flag
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index ef350f0d8925..d0278e5190ba 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >  		INIT_LIST_HEAD(&pool->sp_sockets);
> >  		spin_lock_init(&pool->sp_lock);
> >  		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > +		/* All threads initially marked "busy" */
> > +		pool->sp_idle_map =
> > +			bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> > +					   svc_pool_map_get_node(i));
> > +		if (!pool->sp_idle_map)
> > +			return NULL;
> >  
> >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> >  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> >  		percpu_counter_destroy(&pool->sp_threads_starved);
> >  
> >  		xa_destroy(&pool->sp_thread_xa);
> > +		bitmap_free(pool->sp_idle_map);
> > +		pool->sp_idle_map = NULL;
> >  	}
> >  	kfree(serv->sv_pools);
> >  	kfree(serv);
> > @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  
> >  	folio_batch_init(&rqstp->rq_fbatch);
> >  
> > -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
> >  	rqstp->rq_server = serv;
> >  	rqstp->rq_pool = pool;
> >  
> > @@ -677,7 +684,7 @@ static struct svc_rqst *
> >  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  {
> >  	static const struct xa_limit limit = {
> > -		.max = U32_MAX,
> > +		.max = RPCSVC_MAXPOOLTHREADS,
> >  	};
> >  	struct svc_rqst	*rqstp;
> >  	int ret;
> > @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> >  					   struct svc_pool *pool)
> >  {
> >  	struct svc_rqst	*rqstp;
> > -	unsigned long index;
> > +	unsigned long bit;
> >  
> > -	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > +	/* Check the pool's idle bitmap locklessly so that multiple
> > +	 * idle searches can proceed concurrently.
> > +	 */
> > +	for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> > +		if (!test_and_clear_bit(bit, pool->sp_idle_map))
> >  			continue;
> 
> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> and so get all the required memory barriers.
> What we are doing here is locking a particular thread for a task, so
> "lock" is an appropriate description of what is happening.
> See also svc_pool_thread_mark_* below.
> 
> >  
> > +		rqstp = xa_load(&pool->sp_thread_xa, bit);
> > +		if (!rqstp)
> > +			break;
> > +
> >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >  		wake_up_process(rqstp->rq_task);
> >  		percpu_counter_inc(&pool->sp_threads_woken);
> > @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> >  	}
> >  
> >  found_pool:
> > -	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> > +	rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> > +			XA_PRESENT);
> >  	if (rqstp) {
> >  		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> >  		task = rqstp->rq_task;
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 7709120b45c1..2844b32c16ea 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >  	return true;
> >  }
> >  
> > +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> > +				      struct svc_rqst *rqstp)
> > +{
> > +	smp_mb__before_atomic();
> > +	set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > +	smp_mb__after_atomic();
> > +}
> 
> There memory barriers above and below bother me.  There is no comment
> telling me what they are protecting against.
> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> were
> 
>     clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> 
> and  that svc_pool_thread_mark_busy were
> 
>    test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> 
> Then it would be more obvious what was happening.

Not obvious to me, but that's very likely because I'm not clear what
clear_bit_unlock() does. :-)

I'll try this change for the next version of the series.


> Thanks,
> NeilBrown
> 
> > +
> > +/*
> > + * Note: If we were awoken, then this rqstp has already been marked busy.
> > + */
> > +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> > +				      struct svc_rqst *rqstp)
> > +{
> > +	smp_mb__before_atomic();
> > +	clear_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > +	smp_mb__after_atomic();
> > +}
> > +
> >  static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> >  {
> >  	struct svc_pool		*pool = rqstp->rq_pool;
> > @@ -756,18 +775,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> >  	set_current_state(TASK_INTERRUPTIBLE);
> >  	smp_mb__before_atomic();
> >  	clear_bit(SP_CONGESTED, &pool->sp_flags);
> > -	clear_bit(RQ_BUSY, &rqstp->rq_flags);
> > -	smp_mb__after_atomic();
> >  
> > -	if (likely(rqst_should_sleep(rqstp)))
> > +	if (likely(rqst_should_sleep(rqstp))) {
> > +		svc_pool_thread_mark_idle(pool, rqstp);
> >  		time_left = schedule_timeout(timeout);
> > -	else
> > +	} else
> >  		__set_current_state(TASK_RUNNING);
> >  
> >  	try_to_freeze();
> >  
> > -	set_bit(RQ_BUSY, &rqstp->rq_flags);
> > -	smp_mb__after_atomic();
> > +	svc_pool_thread_mark_busy(pool, rqstp);
> > +
> >  	rqstp->rq_xprt = svc_xprt_dequeue(pool);
> >  	if (rqstp->rq_xprt) {
> >  		trace_svc_pool_awoken(rqstp);
> > 
> > 
> > 
>

NeilBrown July 4, 2023, 2:17 a.m. UTC | #3

On Tue, 04 Jul 2023, Chuck Lever wrote:
> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> > On Tue, 04 Jul 2023, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > I've noticed that client-observed server request latency goes up
> > > simply when the nfsd thread count is increased.
> > > 
> > > List walking is known to be memory-inefficient. On a busy server
> > > with many threads, enqueuing a transport will walk the "all threads"
> > > list quite frequently. This also pulls in the cache lines for some
> > > hot fields in each svc_rqst (namely, rq_flags).
> > 
> > I think this text could usefully be re-written.  By this point in the
> > series we aren't list walking.
> > 
> > I'd also be curious to know what latency different you get for just this
> > change.
> 
> Not much of a latency difference at lower thread counts.
> 
> The difference I notice is that with the spinlock version of
> pool_wake_idle_thread, there is significant lock contention as
> the thread count increases, and the throughput result of my fio
> test is lower (outside the result variance).
> 
> 
> > > The svc_xprt_enqueue() call that concerns me most is the one in
> > > svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> > > down completion handling limits the total throughput per RDMA
> > > connection.
> > > 
> > > So, avoid walking the "all threads" list to find an idle thread to
> > > wake. Instead, set up an idle bitmap and use find_next_bit, which
> > > should work the same way as RQ_BUSY but it will touch only the
> > > cachelines that the bitmap is in. Stick with atomic bit operations
> > > to avoid taking the pool lock.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  include/linux/sunrpc/svc.h    |    6 ++++--
> > >  include/trace/events/sunrpc.h |    1 -
> > >  net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
> > >  net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
> > >  4 files changed, 49 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index 6f8bfcd44250..27ffcf7371d0 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -35,6 +35,7 @@ struct svc_pool {
> > >  	spinlock_t		sp_lock;	/* protects sp_sockets */
> > >  	struct list_head	sp_sockets;	/* pending sockets */
> > >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> > > +	unsigned long		*sp_idle_map;	/* idle threads */
> > >  	struct xarray		sp_thread_xa;
> > >  
> > >  	/* statistics on pool operation */
> > > @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > >  #define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> > >  				+ 2 + 1)
> > >  
> > > +#define RPCSVC_MAXPOOLTHREADS	(4096)
> > > +
> > >  /*
> > >   * The context of a single thread, including the request currently being
> > >   * processed.
> > > @@ -239,8 +242,7 @@ struct svc_rqst {
> > >  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> > >  						 * to prevent encrypting page
> > >  						 * cache pages */
> > > -#define	RQ_BUSY		(5)			/* request is busy */
> > > -#define	RQ_DATA		(6)			/* request has data */
> > > +#define	RQ_DATA		(5)			/* request has data */
> > 
> > Might this be a good opportunity to convert this to an enum ??
> > 
> > >  	unsigned long		rq_flags;	/* flags field */
> > >  	u32			rq_thread_id;	/* xarray index */
> > >  	ktime_t			rq_qtime;	/* enqueue time */
> > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > index ea43c6059bdb..c07824a254bf 100644
> > > --- a/include/trace/events/sunrpc.h
> > > +++ b/include/trace/events/sunrpc.h
> > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > >  	svc_rqst_flag(USEDEFERRAL)					\
> > >  	svc_rqst_flag(DROPME)						\
> > >  	svc_rqst_flag(SPLICE_OK)					\
> > > -	svc_rqst_flag(BUSY)						\
> > >  	svc_rqst_flag_end(DATA)
> > >  
> > >  #undef svc_rqst_flag
> > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > index ef350f0d8925..d0278e5190ba 100644
> > > --- a/net/sunrpc/svc.c
> > > +++ b/net/sunrpc/svc.c
> > > @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > >  		INIT_LIST_HEAD(&pool->sp_sockets);
> > >  		spin_lock_init(&pool->sp_lock);
> > >  		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > > +		/* All threads initially marked "busy" */
> > > +		pool->sp_idle_map =
> > > +			bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> > > +					   svc_pool_map_get_node(i));
> > > +		if (!pool->sp_idle_map)
> > > +			return NULL;
> > >  
> > >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > >  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> > >  		percpu_counter_destroy(&pool->sp_threads_starved);
> > >  
> > >  		xa_destroy(&pool->sp_thread_xa);
> > > +		bitmap_free(pool->sp_idle_map);
> > > +		pool->sp_idle_map = NULL;
> > >  	}
> > >  	kfree(serv->sv_pools);
> > >  	kfree(serv);
> > > @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> > >  
> > >  	folio_batch_init(&rqstp->rq_fbatch);
> > >  
> > > -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
> > >  	rqstp->rq_server = serv;
> > >  	rqstp->rq_pool = pool;
> > >  
> > > @@ -677,7 +684,7 @@ static struct svc_rqst *
> > >  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > >  {
> > >  	static const struct xa_limit limit = {
> > > -		.max = U32_MAX,
> > > +		.max = RPCSVC_MAXPOOLTHREADS,
> > >  	};
> > >  	struct svc_rqst	*rqstp;
> > >  	int ret;
> > > @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > >  					   struct svc_pool *pool)
> > >  {
> > >  	struct svc_rqst	*rqstp;
> > > -	unsigned long index;
> > > +	unsigned long bit;
> > >  
> > > -	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > > -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > > +	/* Check the pool's idle bitmap locklessly so that multiple
> > > +	 * idle searches can proceed concurrently.
> > > +	 */
> > > +	for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> > > +		if (!test_and_clear_bit(bit, pool->sp_idle_map))
> > >  			continue;
> > 
> > I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> > Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> > and so get all the required memory barriers.
> > What we are doing here is locking a particular thread for a task, so
> > "lock" is an appropriate description of what is happening.
> > See also svc_pool_thread_mark_* below.
> > 
> > >  
> > > +		rqstp = xa_load(&pool->sp_thread_xa, bit);
> > > +		if (!rqstp)
> > > +			break;
> > > +
> > >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > >  		wake_up_process(rqstp->rq_task);
> > >  		percpu_counter_inc(&pool->sp_threads_woken);
> > > @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> > >  	}
> > >  
> > >  found_pool:
> > > -	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> > > +	rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> > > +			XA_PRESENT);
> > >  	if (rqstp) {
> > >  		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > >  		task = rqstp->rq_task;
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 7709120b45c1..2844b32c16ea 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> > >  	return true;
> > >  }
> > >  
> > > +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> > > +				      struct svc_rqst *rqstp)
> > > +{
> > > +	smp_mb__before_atomic();
> > > +	set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > > +	smp_mb__after_atomic();
> > > +}
> > 
> > There memory barriers above and below bother me.  There is no comment
> > telling me what they are protecting against.
> > I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> > were
> > 
> >     clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> > 
> > and  that svc_pool_thread_mark_busy were
> > 
> >    test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> > 
> > Then it would be more obvious what was happening.
> 
> Not obvious to me, but that's very likely because I'm not clear what
> clear_bit_unlock() does. :-)

In general, any "lock" operation (mutex, spin, whatever) is (and must
be) and "acquire" type operations which imposes a memory barrier so that
read requests *after* the lock cannot be satisfied with data from
*before* the lock.  The read must access data after the lock.
Conversely any "unlock" operations is a "release" type operation which
imposes a memory barrier so that any write request *before* the unlock
must not be delayed until *after* the unlock.  The write must complete
before the unlock.

This is exactly what you would expect of locking - it creates a closed
code region that is properly ordered w.r.t comparable closed regions.

test_and_set_bit_lock() and clear_bit_unlock() provide these expected
semantics for bit operations.

New code should (almost?) never have explicit memory barriers like
smp_mb__after_atomic().
It should use one of the many APIs with _acquire or _release suffixes,
or with the more explicit _lock or _unlock.

> 
> I'll try this change for the next version of the series.
> 

Thanks.

NeilBrown

Chuck Lever July 4, 2023, 3:14 p.m. UTC | #4

> On Jul 3, 2023, at 10:17 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Tue, 04 Jul 2023, Chuck Lever wrote:
>> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
>>> On Tue, 04 Jul 2023, Chuck Lever wrote:
>>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>> 
>>>> I've noticed that client-observed server request latency goes up
>>>> simply when the nfsd thread count is increased.
>>>> 
>>>> List walking is known to be memory-inefficient. On a busy server
>>>> with many threads, enqueuing a transport will walk the "all threads"
>>>> list quite frequently. This also pulls in the cache lines for some
>>>> hot fields in each svc_rqst (namely, rq_flags).
>>> 
>>> I think this text could usefully be re-written.  By this point in the
>>> series we aren't list walking.
>>> 
>>> I'd also be curious to know what latency different you get for just this
>>> change.
>> 
>> Not much of a latency difference at lower thread counts.
>> 
>> The difference I notice is that with the spinlock version of
>> pool_wake_idle_thread, there is significant lock contention as
>> the thread count increases, and the throughput result of my fio
>> test is lower (outside the result variance).
>> 
>> 
>>>> The svc_xprt_enqueue() call that concerns me most is the one in
>>>> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
>>>> down completion handling limits the total throughput per RDMA
>>>> connection.
>>>> 
>>>> So, avoid walking the "all threads" list to find an idle thread to
>>>> wake. Instead, set up an idle bitmap and use find_next_bit, which
>>>> should work the same way as RQ_BUSY but it will touch only the
>>>> cachelines that the bitmap is in. Stick with atomic bit operations
>>>> to avoid taking the pool lock.
>>>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>> include/linux/sunrpc/svc.h    |    6 ++++--
>>>> include/trace/events/sunrpc.h |    1 -
>>>> net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
>>>> net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
>>>> 4 files changed, 49 insertions(+), 15 deletions(-)
>>>> 
>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>> index 6f8bfcd44250..27ffcf7371d0 100644
>>>> --- a/include/linux/sunrpc/svc.h
>>>> +++ b/include/linux/sunrpc/svc.h
>>>> @@ -35,6 +35,7 @@ struct svc_pool {
>>>> spinlock_t sp_lock; /* protects sp_sockets */
>>>> struct list_head sp_sockets; /* pending sockets */
>>>> unsigned int sp_nrthreads; /* # of threads in pool */
>>>> + unsigned long *sp_idle_map; /* idle threads */
>>>> struct xarray sp_thread_xa;
>>>> 
>>>> /* statistics on pool operation */
>>>> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>>> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
>>>> + 2 + 1)
>>>> 
>>>> +#define RPCSVC_MAXPOOLTHREADS (4096)
>>>> +
>>>> /*
>>>>  * The context of a single thread, including the request currently being
>>>>  * processed.
>>>> @@ -239,8 +242,7 @@ struct svc_rqst {
>>>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
>>>> * to prevent encrypting page
>>>> * cache pages */
>>>> -#define RQ_BUSY (5) /* request is busy */
>>>> -#define RQ_DATA (6) /* request has data */
>>>> +#define RQ_DATA (5) /* request has data */
>>> 
>>> Might this be a good opportunity to convert this to an enum ??
>>> 
>>>> unsigned long rq_flags; /* flags field */
>>>> u32 rq_thread_id; /* xarray index */
>>>> ktime_t rq_qtime; /* enqueue time */
>>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
>>>> index ea43c6059bdb..c07824a254bf 100644
>>>> --- a/include/trace/events/sunrpc.h
>>>> +++ b/include/trace/events/sunrpc.h
>>>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>>>> svc_rqst_flag(USEDEFERRAL) \
>>>> svc_rqst_flag(DROPME) \
>>>> svc_rqst_flag(SPLICE_OK) \
>>>> - svc_rqst_flag(BUSY) \
>>>> svc_rqst_flag_end(DATA)
>>>> 
>>>> #undef svc_rqst_flag
>>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>>>> index ef350f0d8925..d0278e5190ba 100644
>>>> --- a/net/sunrpc/svc.c
>>>> +++ b/net/sunrpc/svc.c
>>>> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>>>> INIT_LIST_HEAD(&pool->sp_sockets);
>>>> spin_lock_init(&pool->sp_lock);
>>>> xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
>>>> + /* All threads initially marked "busy" */
>>>> + pool->sp_idle_map =
>>>> + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
>>>> +   svc_pool_map_get_node(i));
>>>> + if (!pool->sp_idle_map)
>>>> + return NULL;
>>>> 
>>>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>>>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
>>>> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
>>>> percpu_counter_destroy(&pool->sp_threads_starved);
>>>> 
>>>> xa_destroy(&pool->sp_thread_xa);
>>>> + bitmap_free(pool->sp_idle_map);
>>>> + pool->sp_idle_map = NULL;
>>>> }
>>>> kfree(serv->sv_pools);
>>>> kfree(serv);
>>>> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>> 
>>>> folio_batch_init(&rqstp->rq_fbatch);
>>>> 
>>>> - __set_bit(RQ_BUSY, &rqstp->rq_flags);
>>>> rqstp->rq_server = serv;
>>>> rqstp->rq_pool = pool;
>>>> 
>>>> @@ -677,7 +684,7 @@ static struct svc_rqst *
>>>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>> {
>>>> static const struct xa_limit limit = {
>>>> - .max = U32_MAX,
>>>> + .max = RPCSVC_MAXPOOLTHREADS,
>>>> };
>>>> struct svc_rqst *rqstp;
>>>> int ret;
>>>> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>>>>   struct svc_pool *pool)
>>>> {
>>>> struct svc_rqst *rqstp;
>>>> - unsigned long index;
>>>> + unsigned long bit;
>>>> 
>>>> - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
>>>> - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
>>>> + /* Check the pool's idle bitmap locklessly so that multiple
>>>> + * idle searches can proceed concurrently.
>>>> + */
>>>> + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
>>>> + if (!test_and_clear_bit(bit, pool->sp_idle_map))
>>>> continue;
>>> 
>>> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
>>> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
>>> and so get all the required memory barriers.
>>> What we are doing here is locking a particular thread for a task, so
>>> "lock" is an appropriate description of what is happening.
>>> See also svc_pool_thread_mark_* below.
>>> 
>>>> 
>>>> + rqstp = xa_load(&pool->sp_thread_xa, bit);
>>>> + if (!rqstp)
>>>> + break;
>>>> +
>>>> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>>>> wake_up_process(rqstp->rq_task);
>>>> percpu_counter_inc(&pool->sp_threads_woken);
>>>> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
>>>> }
>>>> 
>>>> found_pool:
>>>> - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
>>>> + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
>>>> + XA_PRESENT);
>>>> if (rqstp) {
>>>> __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>>>> task = rqstp->rq_task;
>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>> index 7709120b45c1..2844b32c16ea 100644
>>>> --- a/net/sunrpc/svc_xprt.c
>>>> +++ b/net/sunrpc/svc_xprt.c
>>>> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>>>> return true;
>>>> }
>>>> 
>>>> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
>>>> +      struct svc_rqst *rqstp)
>>>> +{
>>>> + smp_mb__before_atomic();
>>>> + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
>>>> + smp_mb__after_atomic();
>>>> +}
>>> 
>>> There memory barriers above and below bother me.  There is no comment
>>> telling me what they are protecting against.
>>> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
>>> were
>>> 
>>>    clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
>>> 
>>> and  that svc_pool_thread_mark_busy were
>>> 
>>>   test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
>>> 
>>> Then it would be more obvious what was happening.
>> 
>> Not obvious to me, but that's very likely because I'm not clear what
>> clear_bit_unlock() does. :-)
> 
> In general, any "lock" operation (mutex, spin, whatever) is (and must
> be) and "acquire" type operations which imposes a memory barrier so that
> read requests *after* the lock cannot be satisfied with data from
> *before* the lock.  The read must access data after the lock.
> Conversely any "unlock" operations is a "release" type operation which
> imposes a memory barrier so that any write request *before* the unlock
> must not be delayed until *after* the unlock.  The write must complete
> before the unlock.
> 
> This is exactly what you would expect of locking - it creates a closed
> code region that is properly ordered w.r.t comparable closed regions.
> 
> test_and_set_bit_lock() and clear_bit_unlock() provide these expected
> semantics for bit operations.

Your explanation is more clear than what I read in Documentation/atomic*
so thanks. I feel a little more armed to make good use of it.


> New code should (almost?) never have explicit memory barriers like
> smp_mb__after_atomic().
> It should use one of the many APIs with _acquire or _release suffixes,
> or with the more explicit _lock or _unlock.

Out of curiosity, is "should never have explicit memory barriers"
documented somewhere? I've been accused of skimming when I read, so
I might have missed it.


--
Chuck Lever

Chuck Lever July 4, 2023, 4:03 p.m. UTC | #5

> On Jul 3, 2023, at 10:02 PM, Chuck Lever <cel@kernel.org> wrote:
> 
> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
>> On Tue, 04 Jul 2023, Chuck Lever wrote:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>> 
>>> I've noticed that client-observed server request latency goes up
>>> simply when the nfsd thread count is increased.
>>> 
>>> List walking is known to be memory-inefficient. On a busy server
>>> with many threads, enqueuing a transport will walk the "all threads"
>>> list quite frequently. This also pulls in the cache lines for some
>>> hot fields in each svc_rqst (namely, rq_flags).
>> 
>> I think this text could usefully be re-written.  By this point in the
>> series we aren't list walking.
>> 
>> I'd also be curious to know what latency different you get for just this
>> change.
> 
> Not much of a latency difference at lower thread counts.
> 
> The difference I notice is that with the spinlock version of
> pool_wake_idle_thread, there is significant lock contention as
> the thread count increases, and the throughput result of my fio
> test is lower (outside the result variance).

I mis-spoke. When I wrote this yesterday I had compared only the
"xarray with bitmap" and the "xarray with spinlock" mechanisms.
I had not tried "xarray only".

Today, while testing review-related fixes, I benchmarked "xarray
only". It behaves like the linked-list implementation it replaces:
performance degrades with anything more than a couple dozen threads
in the pool.


--
Chuck Lever

NeilBrown July 5, 2023, 12:34 a.m. UTC | #6

On Wed, 05 Jul 2023, Chuck Lever III wrote:
> 
> > On Jul 3, 2023, at 10:17 PM, NeilBrown <neilb@suse.de> wrote:
> > 
> > On Tue, 04 Jul 2023, Chuck Lever wrote:
> >> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> >>> On Tue, 04 Jul 2023, Chuck Lever wrote:
> >>>> From: Chuck Lever <chuck.lever@oracle.com>
> >>>> 
> >>>> I've noticed that client-observed server request latency goes up
> >>>> simply when the nfsd thread count is increased.
> >>>> 
> >>>> List walking is known to be memory-inefficient. On a busy server
> >>>> with many threads, enqueuing a transport will walk the "all threads"
> >>>> list quite frequently. This also pulls in the cache lines for some
> >>>> hot fields in each svc_rqst (namely, rq_flags).
> >>> 
> >>> I think this text could usefully be re-written.  By this point in the
> >>> series we aren't list walking.
> >>> 
> >>> I'd also be curious to know what latency different you get for just this
> >>> change.
> >> 
> >> Not much of a latency difference at lower thread counts.
> >> 
> >> The difference I notice is that with the spinlock version of
> >> pool_wake_idle_thread, there is significant lock contention as
> >> the thread count increases, and the throughput result of my fio
> >> test is lower (outside the result variance).
> >> 
> >> 
> >>>> The svc_xprt_enqueue() call that concerns me most is the one in
> >>>> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> >>>> down completion handling limits the total throughput per RDMA
> >>>> connection.
> >>>> 
> >>>> So, avoid walking the "all threads" list to find an idle thread to
> >>>> wake. Instead, set up an idle bitmap and use find_next_bit, which
> >>>> should work the same way as RQ_BUSY but it will touch only the
> >>>> cachelines that the bitmap is in. Stick with atomic bit operations
> >>>> to avoid taking the pool lock.
> >>>> 
> >>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >>>> ---
> >>>> include/linux/sunrpc/svc.h    |    6 ++++--
> >>>> include/trace/events/sunrpc.h |    1 -
> >>>> net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
> >>>> net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
> >>>> 4 files changed, 49 insertions(+), 15 deletions(-)
> >>>> 
> >>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> >>>> index 6f8bfcd44250..27ffcf7371d0 100644
> >>>> --- a/include/linux/sunrpc/svc.h
> >>>> +++ b/include/linux/sunrpc/svc.h
> >>>> @@ -35,6 +35,7 @@ struct svc_pool {
> >>>> spinlock_t sp_lock; /* protects sp_sockets */
> >>>> struct list_head sp_sockets; /* pending sockets */
> >>>> unsigned int sp_nrthreads; /* # of threads in pool */
> >>>> + unsigned long *sp_idle_map; /* idle threads */
> >>>> struct xarray sp_thread_xa;
> >>>> 
> >>>> /* statistics on pool operation */
> >>>> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >>>> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> >>>> + 2 + 1)
> >>>> 
> >>>> +#define RPCSVC_MAXPOOLTHREADS (4096)
> >>>> +
> >>>> /*
> >>>>  * The context of a single thread, including the request currently being
> >>>>  * processed.
> >>>> @@ -239,8 +242,7 @@ struct svc_rqst {
> >>>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> >>>> * to prevent encrypting page
> >>>> * cache pages */
> >>>> -#define RQ_BUSY (5) /* request is busy */
> >>>> -#define RQ_DATA (6) /* request has data */
> >>>> +#define RQ_DATA (5) /* request has data */
> >>> 
> >>> Might this be a good opportunity to convert this to an enum ??
> >>> 
> >>>> unsigned long rq_flags; /* flags field */
> >>>> u32 rq_thread_id; /* xarray index */
> >>>> ktime_t rq_qtime; /* enqueue time */
> >>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> >>>> index ea43c6059bdb..c07824a254bf 100644
> >>>> --- a/include/trace/events/sunrpc.h
> >>>> +++ b/include/trace/events/sunrpc.h
> >>>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> >>>> svc_rqst_flag(USEDEFERRAL) \
> >>>> svc_rqst_flag(DROPME) \
> >>>> svc_rqst_flag(SPLICE_OK) \
> >>>> - svc_rqst_flag(BUSY) \
> >>>> svc_rqst_flag_end(DATA)
> >>>> 
> >>>> #undef svc_rqst_flag
> >>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> >>>> index ef350f0d8925..d0278e5190ba 100644
> >>>> --- a/net/sunrpc/svc.c
> >>>> +++ b/net/sunrpc/svc.c
> >>>> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >>>> INIT_LIST_HEAD(&pool->sp_sockets);
> >>>> spin_lock_init(&pool->sp_lock);
> >>>> xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> >>>> + /* All threads initially marked "busy" */
> >>>> + pool->sp_idle_map =
> >>>> + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> >>>> +   svc_pool_map_get_node(i));
> >>>> + if (!pool->sp_idle_map)
> >>>> + return NULL;
> >>>> 
> >>>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> >>>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> >>>> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> >>>> percpu_counter_destroy(&pool->sp_threads_starved);
> >>>> 
> >>>> xa_destroy(&pool->sp_thread_xa);
> >>>> + bitmap_free(pool->sp_idle_map);
> >>>> + pool->sp_idle_map = NULL;
> >>>> }
> >>>> kfree(serv->sv_pools);
> >>>> kfree(serv);
> >>>> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> >>>> 
> >>>> folio_batch_init(&rqstp->rq_fbatch);
> >>>> 
> >>>> - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> >>>> rqstp->rq_server = serv;
> >>>> rqstp->rq_pool = pool;
> >>>> 
> >>>> @@ -677,7 +684,7 @@ static struct svc_rqst *
> >>>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >>>> {
> >>>> static const struct xa_limit limit = {
> >>>> - .max = U32_MAX,
> >>>> + .max = RPCSVC_MAXPOOLTHREADS,
> >>>> };
> >>>> struct svc_rqst *rqstp;
> >>>> int ret;
> >>>> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> >>>>   struct svc_pool *pool)
> >>>> {
> >>>> struct svc_rqst *rqstp;
> >>>> - unsigned long index;
> >>>> + unsigned long bit;
> >>>> 
> >>>> - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> >>>> - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> >>>> + /* Check the pool's idle bitmap locklessly so that multiple
> >>>> + * idle searches can proceed concurrently.
> >>>> + */
> >>>> + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> >>>> + if (!test_and_clear_bit(bit, pool->sp_idle_map))
> >>>> continue;
> >>> 
> >>> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> >>> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> >>> and so get all the required memory barriers.
> >>> What we are doing here is locking a particular thread for a task, so
> >>> "lock" is an appropriate description of what is happening.
> >>> See also svc_pool_thread_mark_* below.
> >>> 
> >>>> 
> >>>> + rqstp = xa_load(&pool->sp_thread_xa, bit);
> >>>> + if (!rqstp)
> >>>> + break;
> >>>> +
> >>>> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >>>> wake_up_process(rqstp->rq_task);
> >>>> percpu_counter_inc(&pool->sp_threads_woken);
> >>>> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> >>>> }
> >>>> 
> >>>> found_pool:
> >>>> - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> >>>> + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> >>>> + XA_PRESENT);
> >>>> if (rqstp) {
> >>>> __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> >>>> task = rqstp->rq_task;
> >>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >>>> index 7709120b45c1..2844b32c16ea 100644
> >>>> --- a/net/sunrpc/svc_xprt.c
> >>>> +++ b/net/sunrpc/svc_xprt.c
> >>>> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >>>> return true;
> >>>> }
> >>>> 
> >>>> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> >>>> +      struct svc_rqst *rqstp)
> >>>> +{
> >>>> + smp_mb__before_atomic();
> >>>> + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> >>>> + smp_mb__after_atomic();
> >>>> +}
> >>> 
> >>> There memory barriers above and below bother me.  There is no comment
> >>> telling me what they are protecting against.
> >>> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> >>> were
> >>> 
> >>>    clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> >>> 
> >>> and  that svc_pool_thread_mark_busy were
> >>> 
> >>>   test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> >>> 
> >>> Then it would be more obvious what was happening.
> >> 
> >> Not obvious to me, but that's very likely because I'm not clear what
> >> clear_bit_unlock() does. :-)
> > 
> > In general, any "lock" operation (mutex, spin, whatever) is (and must
> > be) and "acquire" type operations which imposes a memory barrier so that
> > read requests *after* the lock cannot be satisfied with data from
> > *before* the lock.  The read must access data after the lock.
> > Conversely any "unlock" operations is a "release" type operation which
> > imposes a memory barrier so that any write request *before* the unlock
> > must not be delayed until *after* the unlock.  The write must complete
> > before the unlock.
> > 
> > This is exactly what you would expect of locking - it creates a closed
> > code region that is properly ordered w.r.t comparable closed regions.
> > 
> > test_and_set_bit_lock() and clear_bit_unlock() provide these expected
> > semantics for bit operations.
> 
> Your explanation is more clear than what I read in Documentation/atomic*
> so thanks. I feel a little more armed to make good use of it.
> 
> 
> > New code should (almost?) never have explicit memory barriers like
> > smp_mb__after_atomic().
> > It should use one of the many APIs with _acquire or _release suffixes,
> > or with the more explicit _lock or _unlock.
> 
> Out of curiosity, is "should never have explicit memory barriers"
> documented somewhere? I've been accused of skimming when I read, so
> I might have missed it.

My wife says I only read every second word of emails :-)

I don't know that it is documented anywhere (maybe I should submit a
patch).  The statement was really my personal rule that seems to be the
natural sequel for the introduction of the many _acquire and _release
interfaces.

NeilBrown

NeilBrown July 5, 2023, 12:35 a.m. UTC | #7

On Wed, 05 Jul 2023, Chuck Lever III wrote:
> > On Jul 3, 2023, at 10:02 PM, Chuck Lever <cel@kernel.org> wrote:
> > 
> > On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> >> On Tue, 04 Jul 2023, Chuck Lever wrote:
> >>> From: Chuck Lever <chuck.lever@oracle.com>
> >>> 
> >>> I've noticed that client-observed server request latency goes up
> >>> simply when the nfsd thread count is increased.
> >>> 
> >>> List walking is known to be memory-inefficient. On a busy server
> >>> with many threads, enqueuing a transport will walk the "all threads"
> >>> list quite frequently. This also pulls in the cache lines for some
> >>> hot fields in each svc_rqst (namely, rq_flags).
> >> 
> >> I think this text could usefully be re-written.  By this point in the
> >> series we aren't list walking.
> >> 
> >> I'd also be curious to know what latency different you get for just this
> >> change.
> > 
> > Not much of a latency difference at lower thread counts.
> > 
> > The difference I notice is that with the spinlock version of
> > pool_wake_idle_thread, there is significant lock contention as
> > the thread count increases, and the throughput result of my fio
> > test is lower (outside the result variance).
> 
> I mis-spoke. When I wrote this yesterday I had compared only the
> "xarray with bitmap" and the "xarray with spinlock" mechanisms.
> I had not tried "xarray only".
> 
> Today, while testing review-related fixes, I benchmarked "xarray
> only". It behaves like the linked-list implementation it replaces:
> performance degrades with anything more than a couple dozen threads
> in the pool.

I'm a little surprised it is that bad, but only a little.
The above is good text to include in the justification of that last
patch.

Thanks for the clarification.
NeilBrown

[v2,9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap

Commit Message

Comments

Patch