diff mbox

[3/4] sunrpc: convert to lockless lookup of queued server threads

Message ID 1416597571-4265-4-git-send-email-jlayton@primarydata.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton Nov. 21, 2014, 7:19 p.m. UTC
Testing has shown that the pool->sp_lock can be a bottleneck on a busy
server. Every time data is received on a socket, the server must take
that lock in order to dequeue a thread from the sp_threads list.

Address this problem by eliminating the sp_threads list (which contains
threads that are currently idle) and replacing it with a RQ_BUSY flag in
svc_rqst. This allows us to walk the sp_all_threads list under the
rcu_read_lock and find a suitable thread for the xprt by doing a
test_and_set_bit.

Note that we do still have a potential atomicity problem however with
this approach.  We don't want svc_xprt_do_enqueue to set the
rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
negative (which indicates that the thread was idle). But, by the time we
check that, the big could be flipped by a waking thread.

To address this, we acquire a new per-rqst spinlock (rq_lock) and take
that before doing the test_and_set_bit. If that returns false, then we
can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
it must set the bit under the same spinlock and can trust that if it was
already set then the rq_xprt is also properly set.

With this scheme, the case where we have an idle thread no longer needs
to take the highly contended pool->sp_lock at all, and that removes the
bottleneck.

That still leaves one issue: What of the case where we walk the whole
sp_all_threads list and don't find an idle thread? Because the search is
lockess, it's possible for the queueing to race with a thread that is
going to sleep. To address that, we queue the xprt and then search again.

If we find an idle thread at that point, we can't attach the xprt to it
directly since that might race with a different thread waking up and
finding it.  All we can do is wake the idle thread back up and let it
attempt to find the now-queued xprt.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Chris Worley <chris.worley@primarydata.com>
---
 include/linux/sunrpc/svc.h    |   4 +-
 include/trace/events/sunrpc.h |   3 +-
 net/sunrpc/svc.c              |   7 +-
 net/sunrpc/svc_xprt.c         | 221 ++++++++++++++++++++++++------------------
 4 files changed, 132 insertions(+), 103 deletions(-)

Comments

J. Bruce Fields Dec. 1, 2014, 11:47 p.m. UTC | #1
On Fri, Nov 21, 2014 at 02:19:30PM -0500, Jeff Layton wrote:
> Testing has shown that the pool->sp_lock can be a bottleneck on a busy
> server. Every time data is received on a socket, the server must take
> that lock in order to dequeue a thread from the sp_threads list.
> 
> Address this problem by eliminating the sp_threads list (which contains
> threads that are currently idle) and replacing it with a RQ_BUSY flag in
> svc_rqst. This allows us to walk the sp_all_threads list under the
> rcu_read_lock and find a suitable thread for the xprt by doing a
> test_and_set_bit.
> 
> Note that we do still have a potential atomicity problem however with
> this approach.  We don't want svc_xprt_do_enqueue to set the
> rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
> negative (which indicates that the thread was idle). But, by the time we
> check that, the big could be flipped by a waking thread.

(Nits: replacing "negative" by "zero" and "big" by "bit".)

> To address this, we acquire a new per-rqst spinlock (rq_lock) and take
> that before doing the test_and_set_bit. If that returns false, then we
> can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
> it must set the bit under the same spinlock and can trust that if it was
> already set then the rq_xprt is also properly set.
> 
> With this scheme, the case where we have an idle thread no longer needs
> to take the highly contended pool->sp_lock at all, and that removes the
> bottleneck.
> 
> That still leaves one issue: What of the case where we walk the whole
> sp_all_threads list and don't find an idle thread? Because the search is
> lockess, it's possible for the queueing to race with a thread that is
> going to sleep. To address that, we queue the xprt and then search again.
> 
> If we find an idle thread at that point, we can't attach the xprt to it
> directly since that might race with a different thread waking up and
> finding it.  All we can do is wake the idle thread back up and let it
> attempt to find the now-queued xprt.

I find it hard to think about how we expect this to affect performance.
So it comes down to the observed results, I guess, but just trying to
get an idea:

	- this eliminates sp_lock.  I think the original idea here was
	  that if interrupts could be routed correctly then there
	  shouldn't normally be cross-cpu contention on this lock.  Do
	  we understand why that didn't pan out?  Is hardware capable of
	  doing this really rare, or is it just too hard to configure it
	  correctly?

	- instead we're walking the list of all threads looking for an
	  idle one.  I suppose that's tpyically not more than a few
	  hundred.  Does this being fast depend on the fact that that
	  list is almost never changed?  Should we be rearranging
	  svc_rqst so frequently-written fields aren't nearby?

--b.

> 
> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
> Tested-by: Chris Worley <chris.worley@primarydata.com>
> ---
>  include/linux/sunrpc/svc.h    |   4 +-
>  include/trace/events/sunrpc.h |   3 +-
>  net/sunrpc/svc.c              |   7 +-
>  net/sunrpc/svc_xprt.c         | 221 ++++++++++++++++++++++++------------------
>  4 files changed, 132 insertions(+), 103 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 513957eba0a5..6f22cfeef5e3 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -45,7 +45,6 @@ struct svc_pool_stats {
>  struct svc_pool {
>  	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
>  	spinlock_t		sp_lock;	/* protects all fields */
> -	struct list_head	sp_threads;	/* idle server threads */
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
>  	struct list_head	sp_all_threads;	/* all server threads */
> @@ -221,7 +220,6 @@ static inline void svc_putu32(struct kvec *iov, __be32 val)
>   * processed.
>   */
>  struct svc_rqst {
> -	struct list_head	rq_list;	/* idle list */
>  	struct list_head	rq_all;		/* all threads list */
>  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
>  	struct svc_xprt *	rq_xprt;	/* transport ptr */
> @@ -264,6 +262,7 @@ struct svc_rqst {
>  						 * to prevent encrypting page
>  						 * cache pages */
>  #define	RQ_VICTIM	(5)			/* about to be shut down */
> +#define	RQ_BUSY		(6)			/* request is busy */
>  	unsigned long		rq_flags;	/* flags field */
>  
>  	void *			rq_argp;	/* decoded arguments */
> @@ -285,6 +284,7 @@ struct svc_rqst {
>  	struct auth_domain *	rq_gssclient;	/* "gss/"-style peer info */
>  	struct svc_cacherep *	rq_cacherep;	/* cache info */
>  	struct task_struct	*rq_task;	/* service thread */
> +	spinlock_t		rq_lock;	/* per-request lock */
>  };
>  
>  #define SVC_NET(svc_rqst)	(svc_rqst->rq_xprt->xpt_net)
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index 08a5fed50f34..ee4438a63a48 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -419,7 +419,8 @@ TRACE_EVENT(xs_tcp_data_recv,
>  		{ (1UL << RQ_USEDEFERRAL),	"RQ_USEDEFERRAL"},	\
>  		{ (1UL << RQ_DROPME),		"RQ_DROPME"},		\
>  		{ (1UL << RQ_SPLICE_OK),	"RQ_SPLICE_OK"},	\
> -		{ (1UL << RQ_VICTIM),		"RQ_VICTIM"})
> +		{ (1UL << RQ_VICTIM),		"RQ_VICTIM"},		\
> +		{ (1UL << RQ_BUSY),		"RQ_BUSY"})
>  
>  TRACE_EVENT(svc_recv,
>  	TP_PROTO(struct svc_rqst *rqst, int status),
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 4edef32f3b9f..4308881d9d0a 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -476,7 +476,6 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>  				i, serv->sv_name);
>  
>  		pool->sp_id = i;
> -		INIT_LIST_HEAD(&pool->sp_threads);
>  		INIT_LIST_HEAD(&pool->sp_sockets);
>  		INIT_LIST_HEAD(&pool->sp_all_threads);
>  		spin_lock_init(&pool->sp_lock);
> @@ -614,12 +613,14 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  		goto out_enomem;
>  
>  	serv->sv_nrthreads++;
> +	__set_bit(RQ_BUSY, &rqstp->rq_flags);
> +	spin_lock_init(&rqstp->rq_lock);
> +	rqstp->rq_server = serv;
> +	rqstp->rq_pool = pool;
>  	spin_lock_bh(&pool->sp_lock);
>  	pool->sp_nrthreads++;
>  	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
>  	spin_unlock_bh(&pool->sp_lock);
> -	rqstp->rq_server = serv;
> -	rqstp->rq_pool = pool;
>  
>  	rqstp->rq_argp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
>  	if (!rqstp->rq_argp)
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 579ff2249562..ed90d955f733 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -310,25 +310,6 @@ char *svc_print_addr(struct svc_rqst *rqstp, char *buf, size_t len)
>  }
>  EXPORT_SYMBOL_GPL(svc_print_addr);
>  
> -/*
> - * Queue up an idle server thread.  Must have pool->sp_lock held.
> - * Note: this is really a stack rather than a queue, so that we only
> - * use as many different threads as we need, and the rest don't pollute
> - * the cache.
> - */
> -static void svc_thread_enqueue(struct svc_pool *pool, struct svc_rqst *rqstp)
> -{
> -	list_add(&rqstp->rq_list, &pool->sp_threads);
> -}
> -
> -/*
> - * Dequeue an nfsd thread.  Must have pool->sp_lock held.
> - */
> -static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
> -{
> -	list_del(&rqstp->rq_list);
> -}
> -
>  static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
>  {
>  	if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> @@ -343,6 +324,7 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>  	struct svc_pool *pool;
>  	struct svc_rqst	*rqstp;
>  	int cpu;
> +	bool queued = false;
>  
>  	if (!svc_xprt_has_something_to_do(xprt))
>  		return;
> @@ -360,37 +342,60 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>  
>  	cpu = get_cpu();
>  	pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
> -	spin_lock_bh(&pool->sp_lock);
>  
>  	atomic_long_inc(&pool->sp_stats.packets);
>  
> -	if (!list_empty(&pool->sp_threads)) {
> -		rqstp = list_entry(pool->sp_threads.next,
> -				   struct svc_rqst,
> -				   rq_list);
> -		dprintk("svc: transport %p served by daemon %p\n",
> -			xprt, rqstp);
> -		svc_thread_dequeue(pool, rqstp);
> -		if (rqstp->rq_xprt)
> -			printk(KERN_ERR
> -				"svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
> -				rqstp, rqstp->rq_xprt);
> -		/* Note the order of the following 3 lines:
> -		 * We want to assign xprt to rqstp->rq_xprt only _after_
> -		 * we've woken up the process, so that we don't race with
> -		 * the lockless check in svc_get_next_xprt().
> +redo_search:
> +	/* find a thread for this xprt */
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> +		/* Do a lockless check first */
> +		if (test_bit(RQ_BUSY, &rqstp->rq_flags))
> +			continue;
> +
> +		/*
> +		 * Once the xprt has been queued, it can only be dequeued by
> +		 * the task that intends to service it. All we can do at that
> +		 * point is to try to wake this thread back up so that it can
> +		 * do so.
>  		 */
> -		svc_xprt_get(xprt);
> -		wake_up_process(rqstp->rq_task);
> -		rqstp->rq_xprt = xprt;
> +		if (!queued) {
> +			spin_lock_bh(&rqstp->rq_lock);
> +			if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags)) {
> +				/* already busy, move on... */
> +				spin_unlock_bh(&rqstp->rq_lock);
> +				continue;
> +			}
> +
> +			/* this one will do */
> +			rqstp->rq_xprt = xprt;
> +			svc_xprt_get(xprt);
> +			spin_unlock_bh(&rqstp->rq_lock);
> +		}
> +		rcu_read_unlock();
> +
>  		atomic_long_inc(&pool->sp_stats.threads_woken);
> -	} else {
> +		wake_up_process(rqstp->rq_task);
> +		put_cpu();
> +		return;
> +	}
> +	rcu_read_unlock();
> +
> +	/*
> +	 * We didn't find an idle thread to use, so we need to queue the xprt.
> +	 * Do so and then search again. If we find one, we can't hook this one
> +	 * up to it directly but we can wake the thread up in the hopes that it
> +	 * will pick it up once it searches for a xprt to service.
> +	 */
> +	if (!queued) {
> +		queued = true;
>  		dprintk("svc: transport %p put into queue\n", xprt);
> +		spin_lock_bh(&pool->sp_lock);
>  		list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
>  		pool->sp_stats.sockets_queued++;
> +		spin_unlock_bh(&pool->sp_lock);
> +		goto redo_search;
>  	}
> -
> -	spin_unlock_bh(&pool->sp_lock);
>  	put_cpu();
>  }
>  
> @@ -408,21 +413,26 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
>  EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
>  
>  /*
> - * Dequeue the first transport.  Must be called with the pool->sp_lock held.
> + * Dequeue the first transport, if there is one.
>   */
>  static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
>  {
> -	struct svc_xprt	*xprt;
> +	struct svc_xprt	*xprt = NULL;
>  
>  	if (list_empty(&pool->sp_sockets))
>  		return NULL;
>  
> -	xprt = list_entry(pool->sp_sockets.next,
> -			  struct svc_xprt, xpt_ready);
> -	list_del_init(&xprt->xpt_ready);
> +	spin_lock_bh(&pool->sp_lock);
> +	if (likely(!list_empty(&pool->sp_sockets))) {
> +		xprt = list_first_entry(&pool->sp_sockets,
> +					struct svc_xprt, xpt_ready);
> +		list_del_init(&xprt->xpt_ready);
> +		svc_xprt_get(xprt);
>  
> -	dprintk("svc: transport %p dequeued, inuse=%d\n",
> -		xprt, atomic_read(&xprt->xpt_ref.refcount));
> +		dprintk("svc: transport %p dequeued, inuse=%d\n",
> +			xprt, atomic_read(&xprt->xpt_ref.refcount));
> +	}
> +	spin_unlock_bh(&pool->sp_lock);
>  
>  	return xprt;
>  }
> @@ -497,16 +507,21 @@ void svc_wake_up(struct svc_serv *serv)
>  
>  	pool = &serv->sv_pools[0];
>  
> -	spin_lock_bh(&pool->sp_lock);
> -	if (!list_empty(&pool->sp_threads)) {
> -		rqstp = list_entry(pool->sp_threads.next,
> -				   struct svc_rqst,
> -				   rq_list);
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> +		/* skip any that aren't queued */
> +		if (test_bit(RQ_BUSY, &rqstp->rq_flags))
> +			continue;
> +		rcu_read_unlock();
>  		dprintk("svc: daemon %p woken up.\n", rqstp);
>  		wake_up_process(rqstp->rq_task);
> -	} else
> -		set_bit(SP_TASK_PENDING, &pool->sp_flags);
> -	spin_unlock_bh(&pool->sp_lock);
> +		return;
> +	}
> +	rcu_read_unlock();
> +
> +	/* No free entries available */
> +	set_bit(SP_TASK_PENDING, &pool->sp_flags);
> +	smp_wmb();
>  }
>  EXPORT_SYMBOL_GPL(svc_wake_up);
>  
> @@ -617,22 +632,47 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>  	return 0;
>  }
>  
> +static bool
> +rqst_should_sleep(struct svc_rqst *rqstp)
> +{
> +	struct svc_pool		*pool = rqstp->rq_pool;
> +
> +	/* did someone call svc_wake_up? */
> +	if (test_and_clear_bit(SP_TASK_PENDING, &pool->sp_flags))
> +		return false;
> +
> +	/* was a socket queued? */
> +	if (!list_empty(&pool->sp_sockets))
> +		return false;
> +
> +	/* are we shutting down? */
> +	if (signalled() || kthread_should_stop())
> +		return false;
> +
> +	/* are we freezing? */
> +	if (freezing(current))
> +		return false;
> +
> +	return true;
> +}
> +
>  static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  {
>  	struct svc_xprt *xprt;
>  	struct svc_pool		*pool = rqstp->rq_pool;
>  	long			time_left = 0;
>  
> +	/* rq_xprt should be clear on entry */
> +	WARN_ON_ONCE(rqstp->rq_xprt);
> +
>  	/* Normally we will wait up to 5 seconds for any required
>  	 * cache information to be provided.
>  	 */
>  	rqstp->rq_chandle.thread_wait = 5*HZ;
>  
> -	spin_lock_bh(&pool->sp_lock);
>  	xprt = svc_xprt_dequeue(pool);
>  	if (xprt) {
>  		rqstp->rq_xprt = xprt;
> -		svc_xprt_get(xprt);
>  
>  		/* As there is a shortage of threads and this request
>  		 * had to be queued, don't allow the thread to wait so
> @@ -640,51 +680,38 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  		 */
>  		rqstp->rq_chandle.thread_wait = 1*HZ;
>  		clear_bit(SP_TASK_PENDING, &pool->sp_flags);
> -	} else {
> -		if (test_and_clear_bit(SP_TASK_PENDING, &pool->sp_flags)) {
> -			xprt = ERR_PTR(-EAGAIN);
> -			goto out;
> -		}
> -		/*
> -		 * We have to be able to interrupt this wait
> -		 * to bring down the daemons ...
> -		 */
> -		set_current_state(TASK_INTERRUPTIBLE);
> +		return xprt;
> +	}
>  
> -		/* No data pending. Go to sleep */
> -		svc_thread_enqueue(pool, rqstp);
> -		spin_unlock_bh(&pool->sp_lock);
> +	/*
> +	 * We have to be able to interrupt this wait
> +	 * to bring down the daemons ...
> +	 */
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	clear_bit(RQ_BUSY, &rqstp->rq_flags);
> +	smp_mb();
> +
> +	if (likely(rqst_should_sleep(rqstp)))
> +		time_left = schedule_timeout(timeout);
> +	else
> +		__set_current_state(TASK_RUNNING);
>  
> -		if (!(signalled() || kthread_should_stop())) {
> -			time_left = schedule_timeout(timeout);
> -			__set_current_state(TASK_RUNNING);
> +	try_to_freeze();
>  
> -			try_to_freeze();
> +	spin_lock_bh(&rqstp->rq_lock);
> +	set_bit(RQ_BUSY, &rqstp->rq_flags);
> +	spin_unlock_bh(&rqstp->rq_lock);
>  
> -			xprt = rqstp->rq_xprt;
> -			if (xprt != NULL)
> -				return xprt;
> -		} else
> -			__set_current_state(TASK_RUNNING);
> +	xprt = rqstp->rq_xprt;
> +	if (xprt != NULL)
> +		return xprt;
>  
> -		spin_lock_bh(&pool->sp_lock);
> -		if (!time_left)
> -			atomic_long_inc(&pool->sp_stats.threads_timedout);
> +	if (!time_left)
> +		atomic_long_inc(&pool->sp_stats.threads_timedout);
>  
> -		xprt = rqstp->rq_xprt;
> -		if (!xprt) {
> -			svc_thread_dequeue(pool, rqstp);
> -			spin_unlock_bh(&pool->sp_lock);
> -			dprintk("svc: server %p, no data yet\n", rqstp);
> -			if (signalled() || kthread_should_stop())
> -				return ERR_PTR(-EINTR);
> -			else
> -				return ERR_PTR(-EAGAIN);
> -		}
> -	}
> -out:
> -	spin_unlock_bh(&pool->sp_lock);
> -	return xprt;
> +	if (signalled() || kthread_should_stop())
> +		return ERR_PTR(-EINTR);
> +	return ERR_PTR(-EAGAIN);
>  }
>  
>  static void svc_add_new_temp_xprt(struct svc_serv *serv, struct svc_xprt *newxpt)
> -- 
> 2.1.0
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Dec. 2, 2014, 12:38 a.m. UTC | #2
On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Fri, Nov 21, 2014 at 02:19:30PM -0500, Jeff Layton wrote:
>> Testing has shown that the pool->sp_lock can be a bottleneck on a busy
>> server. Every time data is received on a socket, the server must take
>> that lock in order to dequeue a thread from the sp_threads list.
>>
>> Address this problem by eliminating the sp_threads list (which contains
>> threads that are currently idle) and replacing it with a RQ_BUSY flag in
>> svc_rqst. This allows us to walk the sp_all_threads list under the
>> rcu_read_lock and find a suitable thread for the xprt by doing a
>> test_and_set_bit.
>>
>> Note that we do still have a potential atomicity problem however with
>> this approach.  We don't want svc_xprt_do_enqueue to set the
>> rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
>> negative (which indicates that the thread was idle). But, by the time we
>> check that, the big could be flipped by a waking thread.
>
> (Nits: replacing "negative" by "zero" and "big" by "bit".)
>
>> To address this, we acquire a new per-rqst spinlock (rq_lock) and take
>> that before doing the test_and_set_bit. If that returns false, then we
>> can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
>> it must set the bit under the same spinlock and can trust that if it was
>> already set then the rq_xprt is also properly set.
>>
>> With this scheme, the case where we have an idle thread no longer needs
>> to take the highly contended pool->sp_lock at all, and that removes the
>> bottleneck.
>>
>> That still leaves one issue: What of the case where we walk the whole
>> sp_all_threads list and don't find an idle thread? Because the search is
>> lockess, it's possible for the queueing to race with a thread that is
>> going to sleep. To address that, we queue the xprt and then search again.
>>
>> If we find an idle thread at that point, we can't attach the xprt to it
>> directly since that might race with a different thread waking up and
>> finding it.  All we can do is wake the idle thread back up and let it
>> attempt to find the now-queued xprt.
>
> I find it hard to think about how we expect this to affect performance.
> So it comes down to the observed results, I guess, but just trying to
> get an idea:
>
>         - this eliminates sp_lock.  I think the original idea here was
>           that if interrupts could be routed correctly then there
>           shouldn't normally be cross-cpu contention on this lock.  Do
>           we understand why that didn't pan out?  Is hardware capable of
>           doing this really rare, or is it just too hard to configure it
>           correctly?

One problem is that a 1MB incoming write will generate a lot of
interrupts. While that is not so noticeable on a 1GigE network, it is
on a 40GigE network. The other thing you should note is that this
workload was generated with ~100 clients pounding on that server, so
there are a fair amount of TCP connections to service in parallel.
Playing with the interrupt routing doesn't necessarily help you so
much when all those connections are hot.

>         - instead we're walking the list of all threads looking for an
>           idle one.  I suppose that's tpyically not more than a few
>           hundred.  Does this being fast depend on the fact that that
>           list is almost never changed?  Should we be rearranging
>           svc_rqst so frequently-written fields aren't nearby?

Given a 64-byte cache line, that is 8 pointers worth on a 64-bit processor.

- rq_all, rq_server, rq_pool, rq_task don't ever change, so perhaps
shove them together into the same cacheline?

- rq_xprt does get set often until we have a full RPC request worth of
data, so perhaps consider moving that.

- OTOH, rq_addr, rq_addrlen, rq_daddr, rq_daddrlen are only set once
we have a full RPC to process, and then keep their values until that
RPC call is finished. That doesn't look too bad.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Dec. 2, 2014, 11:57 a.m. UTC | #3
On Mon, 1 Dec 2014 19:38:19 -0500
Trond Myklebust <trondmy@gmail.com> wrote:

> On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Fri, Nov 21, 2014 at 02:19:30PM -0500, Jeff Layton wrote:
> >> Testing has shown that the pool->sp_lock can be a bottleneck on a busy
> >> server. Every time data is received on a socket, the server must take
> >> that lock in order to dequeue a thread from the sp_threads list.
> >>
> >> Address this problem by eliminating the sp_threads list (which contains
> >> threads that are currently idle) and replacing it with a RQ_BUSY flag in
> >> svc_rqst. This allows us to walk the sp_all_threads list under the
> >> rcu_read_lock and find a suitable thread for the xprt by doing a
> >> test_and_set_bit.
> >>
> >> Note that we do still have a potential atomicity problem however with
> >> this approach.  We don't want svc_xprt_do_enqueue to set the
> >> rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
> >> negative (which indicates that the thread was idle). But, by the time we
> >> check that, the big could be flipped by a waking thread.
> >
> > (Nits: replacing "negative" by "zero" and "big" by "bit".)
> >

> >> To address this, we acquire a new per-rqst spinlock (rq_lock) and take
> >> that before doing the test_and_set_bit. If that returns false, then we
> >> can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
> >> it must set the bit under the same spinlock and can trust that if it was
> >> already set then the rq_xprt is also properly set.
> >>
> >> With this scheme, the case where we have an idle thread no longer needs
> >> to take the highly contended pool->sp_lock at all, and that removes the
> >> bottleneck.
> >>
> >> That still leaves one issue: What of the case where we walk the whole
> >> sp_all_threads list and don't find an idle thread? Because the search is
> >> lockess, it's possible for the queueing to race with a thread that is
> >> going to sleep. To address that, we queue the xprt and then search again.
> >>
> >> If we find an idle thread at that point, we can't attach the xprt to it
> >> directly since that might race with a different thread waking up and
> >> finding it.  All we can do is wake the idle thread back up and let it
> >> attempt to find the now-queued xprt.
> >
> > I find it hard to think about how we expect this to affect performance.
> > So it comes down to the observed results, I guess, but just trying to
> > get an idea:
> >
> >         - this eliminates sp_lock.  I think the original idea here was
> >           that if interrupts could be routed correctly then there
> >           shouldn't normally be cross-cpu contention on this lock.  Do
> >           we understand why that didn't pan out?  Is hardware capable of
> >           doing this really rare, or is it just too hard to configure it
> >           correctly?
> 
> One problem is that a 1MB incoming write will generate a lot of
> interrupts. While that is not so noticeable on a 1GigE network, it is
> on a 40GigE network. The other thing you should note is that this
> workload was generated with ~100 clients pounding on that server, so
> there are a fair amount of TCP connections to service in parallel.
> Playing with the interrupt routing doesn't necessarily help you so
> much when all those connections are hot.
> 
> >         - instead we're walking the list of all threads looking for an
> >           idle one.  I suppose that's tpyically not more than a few
> >           hundred.  Does this being fast depend on the fact that that
> >           list is almost never changed?  Should we be rearranging
> >           svc_rqst so frequently-written fields aren't nearby?
> 
> Given a 64-byte cache line, that is 8 pointers worth on a 64-bit processor.
> 
> - rq_all, rq_server, rq_pool, rq_task don't ever change, so perhaps
> shove them together into the same cacheline?
> 
> - rq_xprt does get set often until we have a full RPC request worth of
> data, so perhaps consider moving that.
> 
> - OTOH, rq_addr, rq_addrlen, rq_daddr, rq_daddrlen are only set once
> we have a full RPC to process, and then keep their values until that
> RPC call is finished. That doesn't look too bad.
> 
> Cheers
>   Trond
Jeff Layton Dec. 2, 2014, 12:14 p.m. UTC | #4
On Tue, 2 Dec 2014 06:57:50 -0500
Jeff Layton <jeff.layton@primarydata.com> wrote:

> On Mon, 1 Dec 2014 19:38:19 -0500
> Trond Myklebust <trondmy@gmail.com> wrote:
> 
> > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > On Fri, Nov 21, 2014 at 02:19:30PM -0500, Jeff Layton wrote:
> > >> Testing has shown that the pool->sp_lock can be a bottleneck on a busy
> > >> server. Every time data is received on a socket, the server must take
> > >> that lock in order to dequeue a thread from the sp_threads list.
> > >>
> > >> Address this problem by eliminating the sp_threads list (which contains
> > >> threads that are currently idle) and replacing it with a RQ_BUSY flag in
> > >> svc_rqst. This allows us to walk the sp_all_threads list under the
> > >> rcu_read_lock and find a suitable thread for the xprt by doing a
> > >> test_and_set_bit.
> > >>
> > >> Note that we do still have a potential atomicity problem however with
> > >> this approach.  We don't want svc_xprt_do_enqueue to set the
> > >> rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
> > >> negative (which indicates that the thread was idle). But, by the time we
> > >> check that, the big could be flipped by a waking thread.
> > >
> > > (Nits: replacing "negative" by "zero" and "big" by "bit".)
> > >
> 

Sorry, hit send too quickly...

Thanks for fixing those.


> > >> To address this, we acquire a new per-rqst spinlock (rq_lock) and take
> > >> that before doing the test_and_set_bit. If that returns false, then we
> > >> can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
> > >> it must set the bit under the same spinlock and can trust that if it was
> > >> already set then the rq_xprt is also properly set.
> > >>
> > >> With this scheme, the case where we have an idle thread no longer needs
> > >> to take the highly contended pool->sp_lock at all, and that removes the
> > >> bottleneck.
> > >>
> > >> That still leaves one issue: What of the case where we walk the whole
> > >> sp_all_threads list and don't find an idle thread? Because the search is
> > >> lockess, it's possible for the queueing to race with a thread that is
> > >> going to sleep. To address that, we queue the xprt and then search again.
> > >>
> > >> If we find an idle thread at that point, we can't attach the xprt to it
> > >> directly since that might race with a different thread waking up and
> > >> finding it.  All we can do is wake the idle thread back up and let it
> > >> attempt to find the now-queued xprt.
> > >
> > > I find it hard to think about how we expect this to affect performance.
> > > So it comes down to the observed results, I guess, but just trying to
> > > get an idea:
> > >
> > >         - this eliminates sp_lock.  I think the original idea here was
> > >           that if interrupts could be routed correctly then there
> > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > >           we understand why that didn't pan out?  Is hardware capable of
> > >           doing this really rare, or is it just too hard to configure it
> > >           correctly?
> > 
> > One problem is that a 1MB incoming write will generate a lot of
> > interrupts. While that is not so noticeable on a 1GigE network, it is
> > on a 40GigE network. The other thing you should note is that this
> > workload was generated with ~100 clients pounding on that server, so
> > there are a fair amount of TCP connections to service in parallel.
> > Playing with the interrupt routing doesn't necessarily help you so
> > much when all those connections are hot.
> > 

In principle though, the percpu pool_mode should have alleviated the
contention on the sp_lock. When an interrupt comes in, the xprt gets
queued to its pool. If there is a pool for each cpu then there should
be no sp_lock contention. The pernode pool mode might also have
alleviated the lock contention to a lesser degree in a NUMA
configuration.

Do we understand why that didn't help?

In any case, I think that doing this with RCU is still preferable.
We're walking a very short list, so doing it lockless is still a
good idea to improve performance without needing to use the percpu
pool_mode.

> > >         - instead we're walking the list of all threads looking for an
> > >           idle one.  I suppose that's tpyically not more than a few
> > >           hundred.  Does this being fast depend on the fact that that
> > >           list is almost never changed?  Should we be rearranging
> > >           svc_rqst so frequently-written fields aren't nearby?
> > 
> > Given a 64-byte cache line, that is 8 pointers worth on a 64-bit processor.
> > 
> > - rq_all, rq_server, rq_pool, rq_task don't ever change, so perhaps
> > shove them together into the same cacheline?
> > 
> > - rq_xprt does get set often until we have a full RPC request worth of
> > data, so perhaps consider moving that.
> > 
> > - OTOH, rq_addr, rq_addrlen, rq_daddr, rq_daddrlen are only set once
> > we have a full RPC to process, and then keep their values until that
> > RPC call is finished. That doesn't look too bad.
> > 

That sounds reasonable to me.
J. Bruce Fields Dec. 2, 2014, 4:50 p.m. UTC | #5
On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> On Tue, 2 Dec 2014 06:57:50 -0500
> Jeff Layton <jeff.layton@primarydata.com> wrote:
> 
> > On Mon, 1 Dec 2014 19:38:19 -0500
> > Trond Myklebust <trondmy@gmail.com> wrote:
> > 
> > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > I find it hard to think about how we expect this to affect performance.
> > > > So it comes down to the observed results, I guess, but just trying to
> > > > get an idea:
> > > >
> > > >         - this eliminates sp_lock.  I think the original idea here was
> > > >           that if interrupts could be routed correctly then there
> > > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > > >           we understand why that didn't pan out?  Is hardware capable of
> > > >           doing this really rare, or is it just too hard to configure it
> > > >           correctly?
> > > 
> > > One problem is that a 1MB incoming write will generate a lot of
> > > interrupts. While that is not so noticeable on a 1GigE network, it is
> > > on a 40GigE network. The other thing you should note is that this
> > > workload was generated with ~100 clients pounding on that server, so
> > > there are a fair amount of TCP connections to service in parallel.
> > > Playing with the interrupt routing doesn't necessarily help you so
> > > much when all those connections are hot.
> > > 
> 
> In principle though, the percpu pool_mode should have alleviated the
> contention on the sp_lock. When an interrupt comes in, the xprt gets
> queued to its pool. If there is a pool for each cpu then there should
> be no sp_lock contention. The pernode pool mode might also have
> alleviated the lock contention to a lesser degree in a NUMA
> configuration.
> 
> Do we understand why that didn't help?

Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if
not entirely orthogonal problem.

(And I thought it should be addressable separately; Trond and I talked
about this in Westford.  I think it currently wakes a thread to handle
each individual tcp segment--but shouldn't it be able to do all the data
copying in the interrupt and wait to wake up a thread until it's got the
entire rpc?)

> In any case, I think that doing this with RCU is still preferable.
> We're walking a very short list, so doing it lockless is still a
> good idea to improve performance without needing to use the percpu
> pool_mode.

I find that entirely plausible.

Maybe it would help to ask SGI people.  Cc'ing Ben Myers in hopes he
could point us to the right person.  It'd be interesting to know:

	- are they using the svc_pool stuff?
	- if not, why not?
	- if so:
		- can they explain how they configure systems to take
		  advantage of it?
		- do they have any recent results showing how it helps?
		- could they test Jeff's patches for performance
		  regressions?

Anyway, I'm off for now, back to work Thursday.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Myers Dec. 2, 2014, 6:53 p.m. UTC | #6
Hey Bruce,

On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote:
> On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> > On Tue, 2 Dec 2014 06:57:50 -0500
> > Jeff Layton <jeff.layton@primarydata.com> wrote:
> > 
> > > On Mon, 1 Dec 2014 19:38:19 -0500
> > > Trond Myklebust <trondmy@gmail.com> wrote:
> > > 
> > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > > I find it hard to think about how we expect this to affect performance.
> > > > > So it comes down to the observed results, I guess, but just trying to
> > > > > get an idea:
> > > > >
> > > > >         - this eliminates sp_lock.  I think the original idea here was
> > > > >           that if interrupts could be routed correctly then there
> > > > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > > > >           we understand why that didn't pan out?  Is hardware capable of
> > > > >           doing this really rare, or is it just too hard to configure it
> > > > >           correctly?
> > > > 
> > > > One problem is that a 1MB incoming write will generate a lot of
> > > > interrupts. While that is not so noticeable on a 1GigE network, it is
> > > > on a 40GigE network. The other thing you should note is that this
> > > > workload was generated with ~100 clients pounding on that server, so
> > > > there are a fair amount of TCP connections to service in parallel.
> > > > Playing with the interrupt routing doesn't necessarily help you so
> > > > much when all those connections are hot.
> > > > 
> > 
> > In principle though, the percpu pool_mode should have alleviated the
> > contention on the sp_lock. When an interrupt comes in, the xprt gets
> > queued to its pool. If there is a pool for each cpu then there should
> > be no sp_lock contention. The pernode pool mode might also have
> > alleviated the lock contention to a lesser degree in a NUMA
> > configuration.
> > 
> > Do we understand why that didn't help?
> 
> Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if
> not entirely orthogonal problem.
> 
> (And I thought it should be addressable separately; Trond and I talked
> about this in Westford.  I think it currently wakes a thread to handle
> each individual tcp segment--but shouldn't it be able to do all the data
> copying in the interrupt and wait to wake up a thread until it's got the
> entire rpc?)
> 
> > In any case, I think that doing this with RCU is still preferable.
> > We're walking a very short list, so doing it lockless is still a
> > good idea to improve performance without needing to use the percpu
> > pool_mode.
> 
> I find that entirely plausible.
> 
> Maybe it would help to ask SGI people.  Cc'ing Ben Myers in hopes he
> could point us to the right person.
>
> It'd be interesting to know:
> 
> 	- are they using the svc_pool stuff?
> 	- if not, why not?
> 	- if so:
> 		- can they explain how they configure systems to take
> 		  advantage of it?
> 		- do they have any recent results showing how it helps?
> 		- could they test Jeff's patches for performance
> 		  regressions?
> 
> Anyway, I'm off for now, back to work Thursday.
> 
> --b.

Andrew Dahl is the right person.  Cc'd. 

Regards,
	Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Dec. 8, 2014, 6:57 p.m. UTC | #7
On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote:
> On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> > On Tue, 2 Dec 2014 06:57:50 -0500
> > Jeff Layton <jeff.layton@primarydata.com> wrote:
> > 
> > > On Mon, 1 Dec 2014 19:38:19 -0500
> > > Trond Myklebust <trondmy@gmail.com> wrote:
> > > 
> > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > > I find it hard to think about how we expect this to affect performance.
> > > > > So it comes down to the observed results, I guess, but just trying to
> > > > > get an idea:
> > > > >
> > > > >         - this eliminates sp_lock.  I think the original idea here was
> > > > >           that if interrupts could be routed correctly then there
> > > > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > > > >           we understand why that didn't pan out?  Is hardware capable of
> > > > >           doing this really rare, or is it just too hard to configure it
> > > > >           correctly?
> > > > 
> > > > One problem is that a 1MB incoming write will generate a lot of
> > > > interrupts. While that is not so noticeable on a 1GigE network, it is
> > > > on a 40GigE network. The other thing you should note is that this
> > > > workload was generated with ~100 clients pounding on that server, so
> > > > there are a fair amount of TCP connections to service in parallel.
> > > > Playing with the interrupt routing doesn't necessarily help you so
> > > > much when all those connections are hot.
> > > > 
> > 
> > In principle though, the percpu pool_mode should have alleviated the
> > contention on the sp_lock. When an interrupt comes in, the xprt gets
> > queued to its pool. If there is a pool for each cpu then there should
> > be no sp_lock contention. The pernode pool mode might also have
> > alleviated the lock contention to a lesser degree in a NUMA
> > configuration.
> > 
> > Do we understand why that didn't help?
> 
> Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if
> not entirely orthogonal problem.
> 
> (And I thought it should be addressable separately; Trond and I talked
> about this in Westford.  I think it currently wakes a thread to handle
> each individual tcp segment--but shouldn't it be able to do all the data
> copying in the interrupt and wait to wake up a thread until it's got the
> entire rpc?)

By the way, Jeff, isn't this part of what's complicating the workqueue
change?  That would seem simpler if we didn't need to queue work until
we had the full rpc.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Dec. 8, 2014, 7:54 p.m. UTC | #8
On Mon, 8 Dec 2014 13:57:31 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote:
> > On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> > > On Tue, 2 Dec 2014 06:57:50 -0500
> > > Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > 
> > > > On Mon, 1 Dec 2014 19:38:19 -0500
> > > > Trond Myklebust <trondmy@gmail.com> wrote:
> > > > 
> > > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > > > I find it hard to think about how we expect this to affect performance.
> > > > > > So it comes down to the observed results, I guess, but just trying to
> > > > > > get an idea:
> > > > > >
> > > > > >         - this eliminates sp_lock.  I think the original idea here was
> > > > > >           that if interrupts could be routed correctly then there
> > > > > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > > > > >           we understand why that didn't pan out?  Is hardware capable of
> > > > > >           doing this really rare, or is it just too hard to configure it
> > > > > >           correctly?
> > > > > 
> > > > > One problem is that a 1MB incoming write will generate a lot of
> > > > > interrupts. While that is not so noticeable on a 1GigE network, it is
> > > > > on a 40GigE network. The other thing you should note is that this
> > > > > workload was generated with ~100 clients pounding on that server, so
> > > > > there are a fair amount of TCP connections to service in parallel.
> > > > > Playing with the interrupt routing doesn't necessarily help you so
> > > > > much when all those connections are hot.
> > > > > 
> > > 
> > > In principle though, the percpu pool_mode should have alleviated the
> > > contention on the sp_lock. When an interrupt comes in, the xprt gets
> > > queued to its pool. If there is a pool for each cpu then there should
> > > be no sp_lock contention. The pernode pool mode might also have
> > > alleviated the lock contention to a lesser degree in a NUMA
> > > configuration.
> > > 
> > > Do we understand why that didn't help?
> > 
> > Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if
> > not entirely orthogonal problem.
> > 
> > (And I thought it should be addressable separately; Trond and I talked
> > about this in Westford.  I think it currently wakes a thread to handle
> > each individual tcp segment--but shouldn't it be able to do all the data
> > copying in the interrupt and wait to wake up a thread until it's got the
> > entire rpc?)
> 
> By the way, Jeff, isn't this part of what's complicating the workqueue
> change?  That would seem simpler if we didn't need to queue work until
> we had the full rpc.
> 

No, I don't think that really adds much in the way of complexity there.

I have that set working. Most of what's holding me up from posting the
next iteration of that set is performance. So far, my testing shows
that the workqueue-based code is slightly slower. I've been trying to
figure out why that is and whether I can do anything about it. Maybe
I'll go ahead and post it as a second RFC set, until I can get to the
bottom of the perf delta.

I have pondered doing what you're suggesting above though and it's not a
trivial change.

The problem is that all of the buffers into which we do receives are
associated with the svc_rqst (which we don't really have when the
interrupt comes in), and not the svc_xprt (which we do have at that
point).

So, you'd need to restructure the code to hang a receive buffer off
of the svc_xprt. Once you receive an entire RPC, you'd then have to
flip that buffer over to a svc_rqst, queue up the job and grab a new
buffer for the xprt (maybe you could swap them?).

The problem is what to do if you don't have a buffer (or svc_rqst)
available when an IRQ comes in. You can't allocate one from softirq
context, so you'd need to offload that case to a workqueue or something
anyway (which adds a bit of complexity as you'd then have to deal with
two different receive paths).

I'm also not sure about RDMA. When you get an RPC, the server usually
has to do an RDMA READ from the client to pull all of the data in. I
don't think you want to do that from softirq context, so that would
also need to be queued up somehow.

All of that said, it would probably reduce some context switching if
we can make that work. Also, I suspect that doing that in the context
of the workqueue-based code would probably be at least a little simpler.
J. Bruce Fields Dec. 8, 2014, 7:58 p.m. UTC | #9
On Mon, Dec 08, 2014 at 02:54:29PM -0500, Jeff Layton wrote:
> On Mon, 8 Dec 2014 13:57:31 -0500
> "J. Bruce Fields" <bfields@fieldses.org> wrote:
> 
> > On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote:
> > > On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> > > > On Tue, 2 Dec 2014 06:57:50 -0500
> > > > Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > > 
> > > > > On Mon, 1 Dec 2014 19:38:19 -0500
> > > > > Trond Myklebust <trondmy@gmail.com> wrote:
> > > > > 
> > > > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > > > > I find it hard to think about how we expect this to affect performance.
> > > > > > > So it comes down to the observed results, I guess, but just trying to
> > > > > > > get an idea:
> > > > > > >
> > > > > > >         - this eliminates sp_lock.  I think the original idea here was
> > > > > > >           that if interrupts could be routed correctly then there
> > > > > > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > > > > > >           we understand why that didn't pan out?  Is hardware capable of
> > > > > > >           doing this really rare, or is it just too hard to configure it
> > > > > > >           correctly?
> > > > > > 
> > > > > > One problem is that a 1MB incoming write will generate a lot of
> > > > > > interrupts. While that is not so noticeable on a 1GigE network, it is
> > > > > > on a 40GigE network. The other thing you should note is that this
> > > > > > workload was generated with ~100 clients pounding on that server, so
> > > > > > there are a fair amount of TCP connections to service in parallel.
> > > > > > Playing with the interrupt routing doesn't necessarily help you so
> > > > > > much when all those connections are hot.
> > > > > > 
> > > > 
> > > > In principle though, the percpu pool_mode should have alleviated the
> > > > contention on the sp_lock. When an interrupt comes in, the xprt gets
> > > > queued to its pool. If there is a pool for each cpu then there should
> > > > be no sp_lock contention. The pernode pool mode might also have
> > > > alleviated the lock contention to a lesser degree in a NUMA
> > > > configuration.
> > > > 
> > > > Do we understand why that didn't help?
> > > 
> > > Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if
> > > not entirely orthogonal problem.
> > > 
> > > (And I thought it should be addressable separately; Trond and I talked
> > > about this in Westford.  I think it currently wakes a thread to handle
> > > each individual tcp segment--but shouldn't it be able to do all the data
> > > copying in the interrupt and wait to wake up a thread until it's got the
> > > entire rpc?)
> > 
> > By the way, Jeff, isn't this part of what's complicating the workqueue
> > change?  That would seem simpler if we didn't need to queue work until
> > we had the full rpc.
> > 
> 
> No, I don't think that really adds much in the way of complexity there.
> 
> I have that set working. Most of what's holding me up from posting the
> next iteration of that set is performance. So far, my testing shows
> that the workqueue-based code is slightly slower. I've been trying to
> figure out why that is and whether I can do anything about it. Maybe
> I'll go ahead and post it as a second RFC set, until I can get to the
> bottom of the perf delta.
> 
> I have pondered doing what you're suggesting above though and it's not a
> trivial change.
> 
> The problem is that all of the buffers into which we do receives are
> associated with the svc_rqst (which we don't really have when the
> interrupt comes in), and not the svc_xprt (which we do have at that
> point).
> 
> So, you'd need to restructure the code to hang a receive buffer off
> of the svc_xprt.

Have you looked at svsk->sk_pages and svc_tcp_{save,restore}_pages?

--b.

> Once you receive an entire RPC, you'd then have to
> flip that buffer over to a svc_rqst, queue up the job and grab a new
> buffer for the xprt (maybe you could swap them?).
> 
> The problem is what to do if you don't have a buffer (or svc_rqst)
> available when an IRQ comes in. You can't allocate one from softirq
> context, so you'd need to offload that case to a workqueue or something
> anyway (which adds a bit of complexity as you'd then have to deal with
> two different receive paths).
> 
> I'm also not sure about RDMA. When you get an RPC, the server usually
> has to do an RDMA READ from the client to pull all of the data in. I
> don't think you want to do that from softirq context, so that would
> also need to be queued up somehow.
> 
> All of that said, it would probably reduce some context switching if
> we can make that work. Also, I suspect that doing that in the context
> of the workqueue-based code would probably be at least a little simpler.
> 
> -- 
> Jeff Layton <jlayton@primarydata.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Dec. 8, 2014, 8:24 p.m. UTC | #10
On Mon, 8 Dec 2014 14:58:55 -0500
"J. Bruce Fields" <bfields@fieldses.org> wrote:

> On Mon, Dec 08, 2014 at 02:54:29PM -0500, Jeff Layton wrote:
> > On Mon, 8 Dec 2014 13:57:31 -0500
> > "J. Bruce Fields" <bfields@fieldses.org> wrote:
> > 
> > > On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote:
> > > > On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> > > > > On Tue, 2 Dec 2014 06:57:50 -0500
> > > > > Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > > > 
> > > > > > On Mon, 1 Dec 2014 19:38:19 -0500
> > > > > > Trond Myklebust <trondmy@gmail.com> wrote:
> > > > > > 
> > > > > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > > > > > I find it hard to think about how we expect this to affect performance.
> > > > > > > > So it comes down to the observed results, I guess, but just trying to
> > > > > > > > get an idea:
> > > > > > > >
> > > > > > > >         - this eliminates sp_lock.  I think the original idea here was
> > > > > > > >           that if interrupts could be routed correctly then there
> > > > > > > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > > > > > > >           we understand why that didn't pan out?  Is hardware capable of
> > > > > > > >           doing this really rare, or is it just too hard to configure it
> > > > > > > >           correctly?
> > > > > > > 
> > > > > > > One problem is that a 1MB incoming write will generate a lot of
> > > > > > > interrupts. While that is not so noticeable on a 1GigE network, it is
> > > > > > > on a 40GigE network. The other thing you should note is that this
> > > > > > > workload was generated with ~100 clients pounding on that server, so
> > > > > > > there are a fair amount of TCP connections to service in parallel.
> > > > > > > Playing with the interrupt routing doesn't necessarily help you so
> > > > > > > much when all those connections are hot.
> > > > > > > 
> > > > > 
> > > > > In principle though, the percpu pool_mode should have alleviated the
> > > > > contention on the sp_lock. When an interrupt comes in, the xprt gets
> > > > > queued to its pool. If there is a pool for each cpu then there should
> > > > > be no sp_lock contention. The pernode pool mode might also have
> > > > > alleviated the lock contention to a lesser degree in a NUMA
> > > > > configuration.
> > > > > 
> > > > > Do we understand why that didn't help?
> > > > 
> > > > Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if
> > > > not entirely orthogonal problem.
> > > > 
> > > > (And I thought it should be addressable separately; Trond and I talked
> > > > about this in Westford.  I think it currently wakes a thread to handle
> > > > each individual tcp segment--but shouldn't it be able to do all the data
> > > > copying in the interrupt and wait to wake up a thread until it's got the
> > > > entire rpc?)
> > > 
> > > By the way, Jeff, isn't this part of what's complicating the workqueue
> > > change?  That would seem simpler if we didn't need to queue work until
> > > we had the full rpc.
> > > 
> > 
> > No, I don't think that really adds much in the way of complexity there.
> > 
> > I have that set working. Most of what's holding me up from posting the
> > next iteration of that set is performance. So far, my testing shows
> > that the workqueue-based code is slightly slower. I've been trying to
> > figure out why that is and whether I can do anything about it. Maybe
> > I'll go ahead and post it as a second RFC set, until I can get to the
> > bottom of the perf delta.
> > 
> > I have pondered doing what you're suggesting above though and it's not a
> > trivial change.
> > 
> > The problem is that all of the buffers into which we do receives are
> > associated with the svc_rqst (which we don't really have when the
> > interrupt comes in), and not the svc_xprt (which we do have at that
> > point).
> > 
> > So, you'd need to restructure the code to hang a receive buffer off
> > of the svc_xprt.
> 
> Have you looked at svsk->sk_pages and svc_tcp_{save,restore}_pages?
> 
> --b.
> 

Ahh, no I hadn't...interesting.

So, basically do the receive into the rqstp's buffer, and if you
don't get everything you need you stuff the pages into the sk_pages
array to await the next pass. Weird design...

Ok, so you could potentially flip that around. Do the receive into the
sk_pages buffer in softirq context, and then hand those off to the rqst
(in some fashion) once you've received a full RPC.

You'd have to work out how to replenish the sk_pages after each
receive, and what to do about RDMA, but it's probably doable.

> > Once you receive an entire RPC, you'd then have to
> > flip that buffer over to a svc_rqst, queue up the job and grab a new
> > buffer for the xprt (maybe you could swap them?).
> > 
> > The problem is what to do if you don't have a buffer (or svc_rqst)
> > available when an IRQ comes in. You can't allocate one from softirq
> > context, so you'd need to offload that case to a workqueue or something
> > anyway (which adds a bit of complexity as you'd then have to deal with
> > two different receive paths).
> > 
> > I'm also not sure about RDMA. When you get an RPC, the server usually
> > has to do an RDMA READ from the client to pull all of the data in. I
> > don't think you want to do that from softirq context, so that would
> > also need to be queued up somehow.
> > 
> > All of that said, it would probably reduce some context switching if
> > we can make that work. Also, I suspect that doing that in the context
> > of the workqueue-based code would probably be at least a little simpler.
> > 
> > -- 
> > Jeff Layton <jlayton@primarydata.com>
J. Bruce Fields Dec. 9, 2014, 4:57 p.m. UTC | #11
On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> On Tue, 2 Dec 2014 06:57:50 -0500
> Jeff Layton <jeff.layton@primarydata.com> wrote:
> 
> > On Mon, 1 Dec 2014 19:38:19 -0500
> > Trond Myklebust <trondmy@gmail.com> wrote:
> > 
> > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > >         - instead we're walking the list of all threads looking for an
> > > >           idle one.  I suppose that's tpyically not more than a few
> > > >           hundred.  Does this being fast depend on the fact that that
> > > >           list is almost never changed?  Should we be rearranging
> > > >           svc_rqst so frequently-written fields aren't nearby?
> > > 
> > > Given a 64-byte cache line, that is 8 pointers worth on a 64-bit processor.
> > > 
> > > - rq_all, rq_server, rq_pool, rq_task don't ever change, so perhaps
> > > shove them together into the same cacheline?
> > > 
> > > - rq_xprt does get set often until we have a full RPC request worth of
> > > data, so perhaps consider moving that.
> > > 
> > > - OTOH, rq_addr, rq_addrlen, rq_daddr, rq_daddrlen are only set once
> > > we have a full RPC to process, and then keep their values until that
> > > RPC call is finished. That doesn't look too bad.

By the way, one thing I forgot when writing the above comment was that
the list we're walking (sp_all_threads) is *still* per-pool (for some
reason I was thinking it was global), so it's really unlikely we're
making things worse here.

Still, reshuffling those svc_rqst fields is easy and might help.

I think your tests probably aren't hitting the worst case here, either:
even in a read-mostly case most interrupts will be handling the (less
frequent but larger) writes.  Maybe an all-stat workload would test the
case where e.g. rq_xprt is written to every time?  But I haven't thought
that through.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Dec. 9, 2014, 5:04 p.m. UTC | #12
On Tue, Dec 02, 2014 at 12:53:58PM -0600, Ben Myers wrote:
> Hey Bruce,
> 
> On Tue, Dec 02, 2014 at 11:50:24AM -0500, J. Bruce Fields wrote:
> > On Tue, Dec 02, 2014 at 07:14:22AM -0500, Jeff Layton wrote:
> > > On Tue, 2 Dec 2014 06:57:50 -0500
> > > Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > 
> > > > On Mon, 1 Dec 2014 19:38:19 -0500
> > > > Trond Myklebust <trondmy@gmail.com> wrote:
> > > > 
> > > > > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > > > > I find it hard to think about how we expect this to affect performance.
> > > > > > So it comes down to the observed results, I guess, but just trying to
> > > > > > get an idea:
> > > > > >
> > > > > >         - this eliminates sp_lock.  I think the original idea here was
> > > > > >           that if interrupts could be routed correctly then there
> > > > > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > > > > >           we understand why that didn't pan out?  Is hardware capable of
> > > > > >           doing this really rare, or is it just too hard to configure it
> > > > > >           correctly?
> > > > > 
> > > > > One problem is that a 1MB incoming write will generate a lot of
> > > > > interrupts. While that is not so noticeable on a 1GigE network, it is
> > > > > on a 40GigE network. The other thing you should note is that this
> > > > > workload was generated with ~100 clients pounding on that server, so
> > > > > there are a fair amount of TCP connections to service in parallel.
> > > > > Playing with the interrupt routing doesn't necessarily help you so
> > > > > much when all those connections are hot.
> > > > > 
> > > 
> > > In principle though, the percpu pool_mode should have alleviated the
> > > contention on the sp_lock. When an interrupt comes in, the xprt gets
> > > queued to its pool. If there is a pool for each cpu then there should
> > > be no sp_lock contention. The pernode pool mode might also have
> > > alleviated the lock contention to a lesser degree in a NUMA
> > > configuration.
> > > 
> > > Do we understand why that didn't help?
> > 
> > Yes, the lots-of-interrupts-per-rpc problem strikes me as a separate if
> > not entirely orthogonal problem.
> > 
> > (And I thought it should be addressable separately; Trond and I talked
> > about this in Westford.  I think it currently wakes a thread to handle
> > each individual tcp segment--but shouldn't it be able to do all the data
> > copying in the interrupt and wait to wake up a thread until it's got the
> > entire rpc?)
> > 
> > > In any case, I think that doing this with RCU is still preferable.
> > > We're walking a very short list, so doing it lockless is still a
> > > good idea to improve performance without needing to use the percpu
> > > pool_mode.
> > 
> > I find that entirely plausible.
> > 
> > Maybe it would help to ask SGI people.  Cc'ing Ben Myers in hopes he
> > could point us to the right person.
> >
> > It'd be interesting to know:
> > 
> > 	- are they using the svc_pool stuff?
> > 	- if not, why not?
> > 	- if so:
> > 		- can they explain how they configure systems to take
> > 		  advantage of it?
> > 		- do they have any recent results showing how it helps?
> > 		- could they test Jeff's patches for performance
> > 		  regressions?
> > 
> > Anyway, I'm off for now, back to work Thursday.
> > 
> > --b.
> 
> Andrew Dahl is the right person.  Cc'd. 

Thanks!

I'm less worried about Jeff's particular changes here, but I would still
really love to see answers to the above questions.

We've had a couple cases now of people trying to use the pool_modes for
performance tuning without good results, and I'd like to figure out
what's happening.  If this keeps up then we may end up just breaking
them by accident (if we haven't already).

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 513957eba0a5..6f22cfeef5e3 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -45,7 +45,6 @@  struct svc_pool_stats {
 struct svc_pool {
 	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
 	spinlock_t		sp_lock;	/* protects all fields */
-	struct list_head	sp_threads;	/* idle server threads */
 	struct list_head	sp_sockets;	/* pending sockets */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
 	struct list_head	sp_all_threads;	/* all server threads */
@@ -221,7 +220,6 @@  static inline void svc_putu32(struct kvec *iov, __be32 val)
  * processed.
  */
 struct svc_rqst {
-	struct list_head	rq_list;	/* idle list */
 	struct list_head	rq_all;		/* all threads list */
 	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
 	struct svc_xprt *	rq_xprt;	/* transport ptr */
@@ -264,6 +262,7 @@  struct svc_rqst {
 						 * to prevent encrypting page
 						 * cache pages */
 #define	RQ_VICTIM	(5)			/* about to be shut down */
+#define	RQ_BUSY		(6)			/* request is busy */
 	unsigned long		rq_flags;	/* flags field */
 
 	void *			rq_argp;	/* decoded arguments */
@@ -285,6 +284,7 @@  struct svc_rqst {
 	struct auth_domain *	rq_gssclient;	/* "gss/"-style peer info */
 	struct svc_cacherep *	rq_cacherep;	/* cache info */
 	struct task_struct	*rq_task;	/* service thread */
+	spinlock_t		rq_lock;	/* per-request lock */
 };
 
 #define SVC_NET(svc_rqst)	(svc_rqst->rq_xprt->xpt_net)
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 08a5fed50f34..ee4438a63a48 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -419,7 +419,8 @@  TRACE_EVENT(xs_tcp_data_recv,
 		{ (1UL << RQ_USEDEFERRAL),	"RQ_USEDEFERRAL"},	\
 		{ (1UL << RQ_DROPME),		"RQ_DROPME"},		\
 		{ (1UL << RQ_SPLICE_OK),	"RQ_SPLICE_OK"},	\
-		{ (1UL << RQ_VICTIM),		"RQ_VICTIM"})
+		{ (1UL << RQ_VICTIM),		"RQ_VICTIM"},		\
+		{ (1UL << RQ_BUSY),		"RQ_BUSY"})
 
 TRACE_EVENT(svc_recv,
 	TP_PROTO(struct svc_rqst *rqst, int status),
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 4edef32f3b9f..4308881d9d0a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -476,7 +476,6 @@  __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 				i, serv->sv_name);
 
 		pool->sp_id = i;
-		INIT_LIST_HEAD(&pool->sp_threads);
 		INIT_LIST_HEAD(&pool->sp_sockets);
 		INIT_LIST_HEAD(&pool->sp_all_threads);
 		spin_lock_init(&pool->sp_lock);
@@ -614,12 +613,14 @@  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 		goto out_enomem;
 
 	serv->sv_nrthreads++;
+	__set_bit(RQ_BUSY, &rqstp->rq_flags);
+	spin_lock_init(&rqstp->rq_lock);
+	rqstp->rq_server = serv;
+	rqstp->rq_pool = pool;
 	spin_lock_bh(&pool->sp_lock);
 	pool->sp_nrthreads++;
 	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
 	spin_unlock_bh(&pool->sp_lock);
-	rqstp->rq_server = serv;
-	rqstp->rq_pool = pool;
 
 	rqstp->rq_argp = kmalloc_node(serv->sv_xdrsize, GFP_KERNEL, node);
 	if (!rqstp->rq_argp)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 579ff2249562..ed90d955f733 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -310,25 +310,6 @@  char *svc_print_addr(struct svc_rqst *rqstp, char *buf, size_t len)
 }
 EXPORT_SYMBOL_GPL(svc_print_addr);
 
-/*
- * Queue up an idle server thread.  Must have pool->sp_lock held.
- * Note: this is really a stack rather than a queue, so that we only
- * use as many different threads as we need, and the rest don't pollute
- * the cache.
- */
-static void svc_thread_enqueue(struct svc_pool *pool, struct svc_rqst *rqstp)
-{
-	list_add(&rqstp->rq_list, &pool->sp_threads);
-}
-
-/*
- * Dequeue an nfsd thread.  Must have pool->sp_lock held.
- */
-static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
-{
-	list_del(&rqstp->rq_list);
-}
-
 static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
 {
 	if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
@@ -343,6 +324,7 @@  static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 	struct svc_pool *pool;
 	struct svc_rqst	*rqstp;
 	int cpu;
+	bool queued = false;
 
 	if (!svc_xprt_has_something_to_do(xprt))
 		return;
@@ -360,37 +342,60 @@  static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 
 	cpu = get_cpu();
 	pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
-	spin_lock_bh(&pool->sp_lock);
 
 	atomic_long_inc(&pool->sp_stats.packets);
 
-	if (!list_empty(&pool->sp_threads)) {
-		rqstp = list_entry(pool->sp_threads.next,
-				   struct svc_rqst,
-				   rq_list);
-		dprintk("svc: transport %p served by daemon %p\n",
-			xprt, rqstp);
-		svc_thread_dequeue(pool, rqstp);
-		if (rqstp->rq_xprt)
-			printk(KERN_ERR
-				"svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
-				rqstp, rqstp->rq_xprt);
-		/* Note the order of the following 3 lines:
-		 * We want to assign xprt to rqstp->rq_xprt only _after_
-		 * we've woken up the process, so that we don't race with
-		 * the lockless check in svc_get_next_xprt().
+redo_search:
+	/* find a thread for this xprt */
+	rcu_read_lock();
+	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+		/* Do a lockless check first */
+		if (test_bit(RQ_BUSY, &rqstp->rq_flags))
+			continue;
+
+		/*
+		 * Once the xprt has been queued, it can only be dequeued by
+		 * the task that intends to service it. All we can do at that
+		 * point is to try to wake this thread back up so that it can
+		 * do so.
 		 */
-		svc_xprt_get(xprt);
-		wake_up_process(rqstp->rq_task);
-		rqstp->rq_xprt = xprt;
+		if (!queued) {
+			spin_lock_bh(&rqstp->rq_lock);
+			if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags)) {
+				/* already busy, move on... */
+				spin_unlock_bh(&rqstp->rq_lock);
+				continue;
+			}
+
+			/* this one will do */
+			rqstp->rq_xprt = xprt;
+			svc_xprt_get(xprt);
+			spin_unlock_bh(&rqstp->rq_lock);
+		}
+		rcu_read_unlock();
+
 		atomic_long_inc(&pool->sp_stats.threads_woken);
-	} else {
+		wake_up_process(rqstp->rq_task);
+		put_cpu();
+		return;
+	}
+	rcu_read_unlock();
+
+	/*
+	 * We didn't find an idle thread to use, so we need to queue the xprt.
+	 * Do so and then search again. If we find one, we can't hook this one
+	 * up to it directly but we can wake the thread up in the hopes that it
+	 * will pick it up once it searches for a xprt to service.
+	 */
+	if (!queued) {
+		queued = true;
 		dprintk("svc: transport %p put into queue\n", xprt);
+		spin_lock_bh(&pool->sp_lock);
 		list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
 		pool->sp_stats.sockets_queued++;
+		spin_unlock_bh(&pool->sp_lock);
+		goto redo_search;
 	}
-
-	spin_unlock_bh(&pool->sp_lock);
 	put_cpu();
 }
 
@@ -408,21 +413,26 @@  void svc_xprt_enqueue(struct svc_xprt *xprt)
 EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
 
 /*
- * Dequeue the first transport.  Must be called with the pool->sp_lock held.
+ * Dequeue the first transport, if there is one.
  */
 static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
 {
-	struct svc_xprt	*xprt;
+	struct svc_xprt	*xprt = NULL;
 
 	if (list_empty(&pool->sp_sockets))
 		return NULL;
 
-	xprt = list_entry(pool->sp_sockets.next,
-			  struct svc_xprt, xpt_ready);
-	list_del_init(&xprt->xpt_ready);
+	spin_lock_bh(&pool->sp_lock);
+	if (likely(!list_empty(&pool->sp_sockets))) {
+		xprt = list_first_entry(&pool->sp_sockets,
+					struct svc_xprt, xpt_ready);
+		list_del_init(&xprt->xpt_ready);
+		svc_xprt_get(xprt);
 
-	dprintk("svc: transport %p dequeued, inuse=%d\n",
-		xprt, atomic_read(&xprt->xpt_ref.refcount));
+		dprintk("svc: transport %p dequeued, inuse=%d\n",
+			xprt, atomic_read(&xprt->xpt_ref.refcount));
+	}
+	spin_unlock_bh(&pool->sp_lock);
 
 	return xprt;
 }
@@ -497,16 +507,21 @@  void svc_wake_up(struct svc_serv *serv)
 
 	pool = &serv->sv_pools[0];
 
-	spin_lock_bh(&pool->sp_lock);
-	if (!list_empty(&pool->sp_threads)) {
-		rqstp = list_entry(pool->sp_threads.next,
-				   struct svc_rqst,
-				   rq_list);
+	rcu_read_lock();
+	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+		/* skip any that aren't queued */
+		if (test_bit(RQ_BUSY, &rqstp->rq_flags))
+			continue;
+		rcu_read_unlock();
 		dprintk("svc: daemon %p woken up.\n", rqstp);
 		wake_up_process(rqstp->rq_task);
-	} else
-		set_bit(SP_TASK_PENDING, &pool->sp_flags);
-	spin_unlock_bh(&pool->sp_lock);
+		return;
+	}
+	rcu_read_unlock();
+
+	/* No free entries available */
+	set_bit(SP_TASK_PENDING, &pool->sp_flags);
+	smp_wmb();
 }
 EXPORT_SYMBOL_GPL(svc_wake_up);
 
@@ -617,22 +632,47 @@  static int svc_alloc_arg(struct svc_rqst *rqstp)
 	return 0;
 }
 
+static bool
+rqst_should_sleep(struct svc_rqst *rqstp)
+{
+	struct svc_pool		*pool = rqstp->rq_pool;
+
+	/* did someone call svc_wake_up? */
+	if (test_and_clear_bit(SP_TASK_PENDING, &pool->sp_flags))
+		return false;
+
+	/* was a socket queued? */
+	if (!list_empty(&pool->sp_sockets))
+		return false;
+
+	/* are we shutting down? */
+	if (signalled() || kthread_should_stop())
+		return false;
+
+	/* are we freezing? */
+	if (freezing(current))
+		return false;
+
+	return true;
+}
+
 static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 {
 	struct svc_xprt *xprt;
 	struct svc_pool		*pool = rqstp->rq_pool;
 	long			time_left = 0;
 
+	/* rq_xprt should be clear on entry */
+	WARN_ON_ONCE(rqstp->rq_xprt);
+
 	/* Normally we will wait up to 5 seconds for any required
 	 * cache information to be provided.
 	 */
 	rqstp->rq_chandle.thread_wait = 5*HZ;
 
-	spin_lock_bh(&pool->sp_lock);
 	xprt = svc_xprt_dequeue(pool);
 	if (xprt) {
 		rqstp->rq_xprt = xprt;
-		svc_xprt_get(xprt);
 
 		/* As there is a shortage of threads and this request
 		 * had to be queued, don't allow the thread to wait so
@@ -640,51 +680,38 @@  static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 		 */
 		rqstp->rq_chandle.thread_wait = 1*HZ;
 		clear_bit(SP_TASK_PENDING, &pool->sp_flags);
-	} else {
-		if (test_and_clear_bit(SP_TASK_PENDING, &pool->sp_flags)) {
-			xprt = ERR_PTR(-EAGAIN);
-			goto out;
-		}
-		/*
-		 * We have to be able to interrupt this wait
-		 * to bring down the daemons ...
-		 */
-		set_current_state(TASK_INTERRUPTIBLE);
+		return xprt;
+	}
 
-		/* No data pending. Go to sleep */
-		svc_thread_enqueue(pool, rqstp);
-		spin_unlock_bh(&pool->sp_lock);
+	/*
+	 * We have to be able to interrupt this wait
+	 * to bring down the daemons ...
+	 */
+	set_current_state(TASK_INTERRUPTIBLE);
+	clear_bit(RQ_BUSY, &rqstp->rq_flags);
+	smp_mb();
+
+	if (likely(rqst_should_sleep(rqstp)))
+		time_left = schedule_timeout(timeout);
+	else
+		__set_current_state(TASK_RUNNING);
 
-		if (!(signalled() || kthread_should_stop())) {
-			time_left = schedule_timeout(timeout);
-			__set_current_state(TASK_RUNNING);
+	try_to_freeze();
 
-			try_to_freeze();
+	spin_lock_bh(&rqstp->rq_lock);
+	set_bit(RQ_BUSY, &rqstp->rq_flags);
+	spin_unlock_bh(&rqstp->rq_lock);
 
-			xprt = rqstp->rq_xprt;
-			if (xprt != NULL)
-				return xprt;
-		} else
-			__set_current_state(TASK_RUNNING);
+	xprt = rqstp->rq_xprt;
+	if (xprt != NULL)
+		return xprt;
 
-		spin_lock_bh(&pool->sp_lock);
-		if (!time_left)
-			atomic_long_inc(&pool->sp_stats.threads_timedout);
+	if (!time_left)
+		atomic_long_inc(&pool->sp_stats.threads_timedout);
 
-		xprt = rqstp->rq_xprt;
-		if (!xprt) {
-			svc_thread_dequeue(pool, rqstp);
-			spin_unlock_bh(&pool->sp_lock);
-			dprintk("svc: server %p, no data yet\n", rqstp);
-			if (signalled() || kthread_should_stop())
-				return ERR_PTR(-EINTR);
-			else
-				return ERR_PTR(-EAGAIN);
-		}
-	}
-out:
-	spin_unlock_bh(&pool->sp_lock);
-	return xprt;
+	if (signalled() || kthread_should_stop())
+		return ERR_PTR(-EINTR);
+	return ERR_PTR(-EAGAIN);
 }
 
 static void svc_add_new_temp_xprt(struct svc_serv *serv, struct svc_xprt *newxpt)