diff mbox series

[1/7] SUNRPC: change service idle list to be an llist

Message ID 20230818014512.26880-2-neilb@suse.de (mailing list archive)
State New, archived
Headers show
Series SUNRPC - queuing improvements. | expand

Commit Message

NeilBrown Aug. 18, 2023, 1:45 a.m. UTC
With an llist we don't need to take a lock to add a thread to the list,
though we still need a lock to remove it.  That will go in the next
patch.

Unlike double-linked lists, a thread cannot reliably remove itself from
the list.  Only the first thread can be removed, and that can change
asynchronously.  So some care is needed.

We already check if there is pending work to do, so we are unlikely to
add ourselves to the idle list and then want to remove ourselves again.

If we DO find something needs to be done after adding ourselves to the
list, we simply wake up the first thread on the list.  If that was us,
we successfully removed ourselves and can continue.  If it was some
other thread, they will do the work that needs to be done.  We can
safely sleep until woken.

We also remove the test on freezing() from rqst_should_sleep().  Instead
we set TASK_FREEZABLE before scheduling.  This makes is safe to
schedule() when a freeze is pending.  As we now loop waiting to be
removed from the idle queue, this is a cleaner way to handle freezing.

task_state_index() incorrectly identifies a task with
   TASK_IDLE | TASK_FREEZABLE
as 'D'.  So fix __task_state_index to ignore extra flags on TASK_IDLE
just as it ignores extra flags on other states.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sched.h      |  4 +--
 include/linux/sunrpc/svc.h | 14 ++++++-----
 net/sunrpc/svc.c           | 13 +++++-----
 net/sunrpc/svc_xprt.c      | 51 +++++++++++++++++++-------------------
 4 files changed, 42 insertions(+), 40 deletions(-)

Comments

Chuck Lever Aug. 18, 2023, 3:53 p.m. UTC | #1
On Fri, Aug 18, 2023 at 11:45:06AM +1000, NeilBrown wrote:
> With an llist we don't need to take a lock to add a thread to the list,
> though we still need a lock to remove it.  That will go in the next
> patch.
> 
> Unlike double-linked lists, a thread cannot reliably remove itself from
> the list.  Only the first thread can be removed, and that can change
> asynchronously.  So some care is needed.
> 
> We already check if there is pending work to do, so we are unlikely to
> add ourselves to the idle list and then want to remove ourselves again.
> 
> If we DO find something needs to be done after adding ourselves to the
> list, we simply wake up the first thread on the list.  If that was us,
> we successfully removed ourselves and can continue.  If it was some
> other thread, they will do the work that needs to be done.  We can
> safely sleep until woken.
> 
> We also remove the test on freezing() from rqst_should_sleep().  Instead
> we set TASK_FREEZABLE before scheduling.  This makes is safe to
> schedule() when a freeze is pending.  As we now loop waiting to be
> removed from the idle queue, this is a cleaner way to handle freezing.
> 
> task_state_index() incorrectly identifies a task with
>    TASK_IDLE | TASK_FREEZABLE
> as 'D'.  So fix __task_state_index to ignore extra flags on TASK_IDLE
> just as it ignores extra flags on other states.

Let's split the task_state_index() change into a separate patch
that can be sent to the scheduler maintainers for their Review/Ack.


> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  include/linux/sched.h      |  4 +--
>  include/linux/sunrpc/svc.h | 14 ++++++-----
>  net/sunrpc/svc.c           | 13 +++++-----
>  net/sunrpc/svc_xprt.c      | 51 +++++++++++++++++++-------------------
>  4 files changed, 42 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 609bde814cb0..a5f3badcb629 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1666,7 +1666,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
>  
>  	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
>  
> -	if (tsk_state == TASK_IDLE)
> +	if ((tsk_state & TASK_IDLE) == TASK_IDLE)
>  		state = TASK_REPORT_IDLE;
>  
>  	/*
> @@ -1674,7 +1674,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
>  	 * to userspace, we can make this appear as if the task has gone through
>  	 * a regular rt_mutex_lock() call.
>  	 */
> -	if (tsk_state == TASK_RTLOCK_WAIT)
> +	if (tsk_state & TASK_RTLOCK_WAIT)
>  		state = TASK_UNINTERRUPTIBLE;
>  
>  	return fls(state);
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 22b3018ebf62..5216f95411e3 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -37,7 +37,7 @@ struct svc_pool {
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
>  	struct list_head	sp_all_threads;	/* all server threads */
> -	struct list_head	sp_idle_threads; /* idle server threads */
> +	struct llist_head	sp_idle_threads; /* idle server threads */
>  
>  	/* statistics on pool operation */
>  	struct percpu_counter	sp_messages_arrived;
> @@ -186,7 +186,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>   */
>  struct svc_rqst {
>  	struct list_head	rq_all;		/* all threads list */
> -	struct list_head	rq_idle;	/* On the idle list */
> +	struct llist_node	rq_idle;	/* On the idle list */
>  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
>  	struct svc_xprt *	rq_xprt;	/* transport ptr */
>  
> @@ -270,22 +270,24 @@ enum {
>   * svc_thread_set_busy - mark a thread as busy
>   * @rqstp: the thread which is now busy
>   *
> - * If rq_idle is "empty", the thread must be busy.
> + * By convention a thread is busy if rq_idle.next points to rq_idle.
> + * This ensures it is not on the idle list.
>   */
>  static inline void svc_thread_set_busy(struct svc_rqst *rqstp)
>  {
> -	INIT_LIST_HEAD(&rqstp->rq_idle);
> +	rqstp->rq_idle.next = &rqstp->rq_idle;
>  }

I don't understand the comment "This ensures it is not on the idle
list." svc_thread_set_busy() is called in two places: The first
when an svc_rqst is created, and once directly after an
llist_del_first() has been done. @rqstp is already not on the
idle list in either case.

What really needs an explanation here is that there's no
existing utility to check whether an llist_node is on a list or
not.


>  /**
>   * svc_thread_busy - check if a thread as busy
>   * @rqstp: the thread which might be busy
>   *
> - * If rq_idle is "empty", the thread must be busy.
> + * By convention a thread is busy if rq_idle.next points to rq_idle.
> + * This ensures it is not on the idle list.

This function doesn't modify the thread, so it can't ensure it
is not on the idle list.


>   */
>  static inline bool svc_thread_busy(struct svc_rqst *rqstp)

const struct svc_rqst *rqstp

>  {
> -	return list_empty(&rqstp->rq_idle);
> +	return rqstp->rq_idle.next == &rqstp->rq_idle;
>  }
>  
>  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 051f08c48418..addbf28ea50a 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -510,7 +510,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>  		pool->sp_id = i;
>  		INIT_LIST_HEAD(&pool->sp_sockets);
>  		INIT_LIST_HEAD(&pool->sp_all_threads);
> -		INIT_LIST_HEAD(&pool->sp_idle_threads);
> +		init_llist_head(&pool->sp_idle_threads);
>  		spin_lock_init(&pool->sp_lock);
>  
>  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> @@ -701,15 +701,16 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  void svc_pool_wake_idle_thread(struct svc_pool *pool)
>  {
>  	struct svc_rqst	*rqstp;
> +	struct llist_node *ln;
>  
>  	rcu_read_lock();
>  	spin_lock_bh(&pool->sp_lock);
> -	rqstp = list_first_entry_or_null(&pool->sp_idle_threads,
> -					 struct svc_rqst, rq_idle);
> -	if (rqstp)
> -		list_del_init(&rqstp->rq_idle);
> +	ln = llist_del_first(&pool->sp_idle_threads);
>  	spin_unlock_bh(&pool->sp_lock);
> -	if (rqstp) {
> +	if (ln) {
> +		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
> +		svc_thread_set_busy(rqstp);
> +
>  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>  		wake_up_process(rqstp->rq_task);
>  		rcu_read_unlock();
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index fa0d854a5596..81327001e074 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -715,10 +715,6 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>  	if (svc_thread_should_stop(rqstp))
>  		return false;
>  
> -	/* are we freezing? */
> -	if (freezing(current))
> -		return false;
> -
>  #if defined(CONFIG_SUNRPC_BACKCHANNEL)
>  	if (svc_is_backchannel(rqstp)) {
>  		if (!list_empty(&rqstp->rq_server->sv_cb_list))
> @@ -734,30 +730,32 @@ static void svc_rqst_wait_for_work(struct svc_rqst *rqstp)
>  	struct svc_pool *pool = rqstp->rq_pool;
>  
>  	if (rqst_should_sleep(rqstp)) {
> -		set_current_state(TASK_IDLE);
> -		spin_lock_bh(&pool->sp_lock);
> -		list_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> -		spin_unlock_bh(&pool->sp_lock);
> +		set_current_state(TASK_IDLE | TASK_FREEZABLE);

I'm trying to learn about the semantics of IDLE|FREEZABLE, but there
isn't another instance of this flag combination in the kernel.


> +		llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> +
> +		if (unlikely(!rqst_should_sleep(rqstp)))
> +			/* maybe there were no idle threads when some work
> +			 * became ready and so nothing was woken.  We've just
> +			 * become idle so someone can to the work - maybe us.
> +			 * But we cannot reliably remove ourselves from the
> +			 * idle list - we can only remove the first task which
> +			 * might be us, and might not.
> +			 * So remove and wake it, then schedule().  If it was
> +			 * us, we won't sleep.  If it is some other thread, they
> +			 * will do the work.
> +			 */

Large block comments suggest that this code doesn't document itself
very well.

At least, some of the textual redundancy can be removed, though this
comment and the next are removed or replaced in later patches. I'll
leave it up to you to find an approach that is a bit cleaner.


> +			svc_pool_wake_idle_thread(pool);
>  
> -		/* Need to check should_sleep() again after
> -		 * setting task state in case a wakeup happened
> -		 * between testing and setting.
> +		/* We mustn't continue while on the idle list, and we
> +		 * cannot remove outselves reliably.  The only "work"
> +		 * we can do while on the idle list is to freeze.
> +		 * So loop until someone removes us
>  		 */

Here and above, the use of "we" obscures the explanation. What is
"we" in this context? I think it might be "this thread" or @rqstp,
but I can't be certain. For instance:

		/* Since a thread cannot remove itself from an llist,
		 * schedule until someone else removes @rqstp from
		 * the idle list.
		 */


> -		if (rqst_should_sleep(rqstp)) {
> +		while (!svc_thread_busy(rqstp)) {

I need to convince myself that this while() can't possibly result in
an infinite loop.


>  			schedule();
> -		} else {
> -			__set_current_state(TASK_RUNNING);
> -			cond_resched();
> -		}
> -
> -		/* We *must* be removed from the list before we can continue.
> -		 * If we were woken, this is already done
> -		 */
> -		if (!svc_thread_busy(rqstp)) {
> -			spin_lock_bh(&pool->sp_lock);
> -			list_del_init(&rqstp->rq_idle);
> -			spin_unlock_bh(&pool->sp_lock);
> +			set_current_state(TASK_IDLE | TASK_FREEZABLE);
>  		}
> +		__set_current_state(TASK_RUNNING);
> 
>  	} else {
>  		cond_resched();
>  	}

There's a try_to_freeze() call just after this hunk. Is there a
reason to invoke cond_resched(); before freezing?


> @@ -870,9 +868,10 @@ void svc_recv(struct svc_rqst *rqstp)
>  		struct svc_xprt *xprt = rqstp->rq_xprt;
>  
>  		/* Normally we will wait up to 5 seconds for any required
> -		 * cache information to be provided.
> +		 * cache information to be provided.  When there are no
> +		 * idle threads, we reduce the wait time.
>  		 */
> -		if (!list_empty(&pool->sp_idle_threads))
> +		if (pool->sp_idle_threads.first)
>  			rqstp->rq_chandle.thread_wait = 5 * HZ;
>  		else
>  			rqstp->rq_chandle.thread_wait = 1 * HZ;
> -- 
> 2.40.1
>
NeilBrown Aug. 30, 2023, 1:52 a.m. UTC | #2
On Sat, 19 Aug 2023, Chuck Lever wrote:
> On Fri, Aug 18, 2023 at 11:45:06AM +1000, NeilBrown wrote:
> > With an llist we don't need to take a lock to add a thread to the list,
> > though we still need a lock to remove it.  That will go in the next
> > patch.
> > 
> > Unlike double-linked lists, a thread cannot reliably remove itself from
> > the list.  Only the first thread can be removed, and that can change
> > asynchronously.  So some care is needed.
> > 
> > We already check if there is pending work to do, so we are unlikely to
> > add ourselves to the idle list and then want to remove ourselves again.
> > 
> > If we DO find something needs to be done after adding ourselves to the
> > list, we simply wake up the first thread on the list.  If that was us,
> > we successfully removed ourselves and can continue.  If it was some
> > other thread, they will do the work that needs to be done.  We can
> > safely sleep until woken.
> > 
> > We also remove the test on freezing() from rqst_should_sleep().  Instead
> > we set TASK_FREEZABLE before scheduling.  This makes is safe to
> > schedule() when a freeze is pending.  As we now loop waiting to be
> > removed from the idle queue, this is a cleaner way to handle freezing.
> > 
> > task_state_index() incorrectly identifies a task with
> >    TASK_IDLE | TASK_FREEZABLE
> > as 'D'.  So fix __task_state_index to ignore extra flags on TASK_IDLE
> > just as it ignores extra flags on other states.
> 
> Let's split the task_state_index() change into a separate patch
> that can be sent to the scheduler maintainers for their Review/Ack.
> 

I've sent that to appropriate people.  It doesn't really matter if it
lands before or after we change sunrpc to use TASK_FREEZABLE.  The only
problem is that nfsd threads look like "D" in a ps listing rather than
"I".  The actually behaviour of the threads isn't changed.

> 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  include/linux/sched.h      |  4 +--
> >  include/linux/sunrpc/svc.h | 14 ++++++-----
> >  net/sunrpc/svc.c           | 13 +++++-----
> >  net/sunrpc/svc_xprt.c      | 51 +++++++++++++++++++-------------------
> >  4 files changed, 42 insertions(+), 40 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 609bde814cb0..a5f3badcb629 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1666,7 +1666,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
> >  
> >  	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
> >  
> > -	if (tsk_state == TASK_IDLE)
> > +	if ((tsk_state & TASK_IDLE) == TASK_IDLE)
> >  		state = TASK_REPORT_IDLE;
> >  
> >  	/*
> > @@ -1674,7 +1674,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
> >  	 * to userspace, we can make this appear as if the task has gone through
> >  	 * a regular rt_mutex_lock() call.
> >  	 */
> > -	if (tsk_state == TASK_RTLOCK_WAIT)
> > +	if (tsk_state & TASK_RTLOCK_WAIT)
> >  		state = TASK_UNINTERRUPTIBLE;
> >  
> >  	return fls(state);
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 22b3018ebf62..5216f95411e3 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -37,7 +37,7 @@ struct svc_pool {
> >  	struct list_head	sp_sockets;	/* pending sockets */
> >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> >  	struct list_head	sp_all_threads;	/* all server threads */
> > -	struct list_head	sp_idle_threads; /* idle server threads */
> > +	struct llist_head	sp_idle_threads; /* idle server threads */
> >  
> >  	/* statistics on pool operation */
> >  	struct percpu_counter	sp_messages_arrived;
> > @@ -186,7 +186,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >   */
> >  struct svc_rqst {
> >  	struct list_head	rq_all;		/* all threads list */
> > -	struct list_head	rq_idle;	/* On the idle list */
> > +	struct llist_node	rq_idle;	/* On the idle list */
> >  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
> >  	struct svc_xprt *	rq_xprt;	/* transport ptr */
> >  
> > @@ -270,22 +270,24 @@ enum {
> >   * svc_thread_set_busy - mark a thread as busy
> >   * @rqstp: the thread which is now busy
> >   *
> > - * If rq_idle is "empty", the thread must be busy.
> > + * By convention a thread is busy if rq_idle.next points to rq_idle.
> > + * This ensures it is not on the idle list.
> >   */
> >  static inline void svc_thread_set_busy(struct svc_rqst *rqstp)
> >  {
> > -	INIT_LIST_HEAD(&rqstp->rq_idle);
> > +	rqstp->rq_idle.next = &rqstp->rq_idle;
> >  }
> 
> I don't understand the comment "This ensures it is not on the idle
> list." svc_thread_set_busy() is called in two places: The first
> when an svc_rqst is created, and once directly after an
> llist_del_first() has been done. @rqstp is already not on the
> idle list in either case.

"ensures" is the wrong word.  Maybe I meant "assures" or even "records".
I change it to
 * This will never be the case for threads on the idle list.

> 
> What really needs an explanation here is that there's no
> existing utility to check whether an llist_node is on a list or
> not.

I guess no-one found the need before...
Though I note that md/raid5.c has a bit flag STRIPE_ON_RELEASE_LIST
to check if the stripe is on a llist.  So it could use that
functionality.

I would happily include a patch to add this to llist.h, except that by
the end of the series we don't need it any more...
Maybe I should do it anyway?
> 
> 
> >  /**
> >   * svc_thread_busy - check if a thread as busy
> >   * @rqstp: the thread which might be busy
> >   *
> > - * If rq_idle is "empty", the thread must be busy.
> > + * By convention a thread is busy if rq_idle.next points to rq_idle.
> > + * This ensures it is not on the idle list.
> 
> This function doesn't modify the thread, so it can't ensure it
> is not on the idle list.
I changed that to

 * This will never be the case for threads on the idle list.

too.

> 
> 
> >   */
> >  static inline bool svc_thread_busy(struct svc_rqst *rqstp)
> 
> const struct svc_rqst *rqstp

ack.

> 
> >  {
> > -	return list_empty(&rqstp->rq_idle);
> > +	return rqstp->rq_idle.next == &rqstp->rq_idle;
> >  }
> >  
> >  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index 051f08c48418..addbf28ea50a 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -510,7 +510,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >  		pool->sp_id = i;
> >  		INIT_LIST_HEAD(&pool->sp_sockets);
> >  		INIT_LIST_HEAD(&pool->sp_all_threads);
> > -		INIT_LIST_HEAD(&pool->sp_idle_threads);
> > +		init_llist_head(&pool->sp_idle_threads);
> >  		spin_lock_init(&pool->sp_lock);
> >  
> >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > @@ -701,15 +701,16 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  void svc_pool_wake_idle_thread(struct svc_pool *pool)
> >  {
> >  	struct svc_rqst	*rqstp;
> > +	struct llist_node *ln;
> >  
> >  	rcu_read_lock();
> >  	spin_lock_bh(&pool->sp_lock);
> > -	rqstp = list_first_entry_or_null(&pool->sp_idle_threads,
> > -					 struct svc_rqst, rq_idle);
> > -	if (rqstp)
> > -		list_del_init(&rqstp->rq_idle);
> > +	ln = llist_del_first(&pool->sp_idle_threads);
> >  	spin_unlock_bh(&pool->sp_lock);
> > -	if (rqstp) {
> > +	if (ln) {
> > +		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
> > +		svc_thread_set_busy(rqstp);
> > +
> >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >  		wake_up_process(rqstp->rq_task);
> >  		rcu_read_unlock();
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index fa0d854a5596..81327001e074 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -715,10 +715,6 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >  	if (svc_thread_should_stop(rqstp))
> >  		return false;
> >  
> > -	/* are we freezing? */
> > -	if (freezing(current))
> > -		return false;
> > -
> >  #if defined(CONFIG_SUNRPC_BACKCHANNEL)
> >  	if (svc_is_backchannel(rqstp)) {
> >  		if (!list_empty(&rqstp->rq_server->sv_cb_list))
> > @@ -734,30 +730,32 @@ static void svc_rqst_wait_for_work(struct svc_rqst *rqstp)
> >  	struct svc_pool *pool = rqstp->rq_pool;
> >  
> >  	if (rqst_should_sleep(rqstp)) {
> > -		set_current_state(TASK_IDLE);
> > -		spin_lock_bh(&pool->sp_lock);
> > -		list_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> > -		spin_unlock_bh(&pool->sp_lock);
> > +		set_current_state(TASK_IDLE | TASK_FREEZABLE);
> 
> I'm trying to learn about the semantics of IDLE|FREEZABLE, but there
> isn't another instance of this flag combination in the kernel.

No - which I why no one else notice that the state reported by ps was
wrong. 
Quite a few places use INTERRUPTIBLE|FREEZABLE which is much the same
thing except sunrpc threads don't want interrupts.

My interpretation of FREEZABLE is that it avoids the need to check for
freezing, or to freeze.  A higher-level interface.

> 
> 
> > +		llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> > +
> > +		if (unlikely(!rqst_should_sleep(rqstp)))
> > +			/* maybe there were no idle threads when some work
> > +			 * became ready and so nothing was woken.  We've just
> > +			 * become idle so someone can to the work - maybe us.
> > +			 * But we cannot reliably remove ourselves from the
> > +			 * idle list - we can only remove the first task which
> > +			 * might be us, and might not.
> > +			 * So remove and wake it, then schedule().  If it was
> > +			 * us, we won't sleep.  If it is some other thread, they
> > +			 * will do the work.
> > +			 */
> 
> Large block comments suggest that this code doesn't document itself
> very well.

Self documenting code is best, but when the logic is subtle and
unfamiliar, a bit of text can help.
Of course the logic is quite familiar to me now and doesn't seem so
subtle, so I wonder if any comment it needed.
> 
> At least, some of the textual redundancy can be removed, though this
> comment and the next are removed or replaced in later patches. I'll
> leave it up to you to find an approach that is a bit cleaner.

Is now:
	/* Work just became available.  This thread cannot simply
	 * choose not to sleep as it *must* wait until removed.
	 * So wake the first waiter - whether it is this
	 * thread or some other, it will get the work done.
	 */


> 
> 
> > +			svc_pool_wake_idle_thread(pool);
> >  
> > -		/* Need to check should_sleep() again after
> > -		 * setting task state in case a wakeup happened
> > -		 * between testing and setting.
> > +		/* We mustn't continue while on the idle list, and we
> > +		 * cannot remove outselves reliably.  The only "work"
> > +		 * we can do while on the idle list is to freeze.
> > +		 * So loop until someone removes us
> >  		 */
> 
> Here and above, the use of "we" obscures the explanation. What is
> "we" in this context? I think it might be "this thread" or @rqstp,
> but I can't be certain. For instance:
> 
> 		/* Since a thread cannot remove itself from an llist,
> 		 * schedule until someone else removes @rqstp from
> 		 * the idle list.
> 		 */

Looks good - thanks.

> 
> 
> > -		if (rqst_should_sleep(rqstp)) {
> > +		while (!svc_thread_busy(rqstp)) {
> 
> I need to convince myself that this while() can't possibly result in
> an infinite loop.
> 

It cannot be a busy-loop because we set state and call schedule().
It could loop for a long time if there is no work to do.
But at service shutdown, everything will be removed from the busy list
and woken, so the loop will exit then.

> 
> >  			schedule();
> > -		} else {
> > -			__set_current_state(TASK_RUNNING);
> > -			cond_resched();
> > -		}
> > -
> > -		/* We *must* be removed from the list before we can continue.
> > -		 * If we were woken, this is already done
> > -		 */
> > -		if (!svc_thread_busy(rqstp)) {
> > -			spin_lock_bh(&pool->sp_lock);
> > -			list_del_init(&rqstp->rq_idle);
> > -			spin_unlock_bh(&pool->sp_lock);
> > +			set_current_state(TASK_IDLE | TASK_FREEZABLE);
> >  		}
> > +		__set_current_state(TASK_RUNNING);
> > 
> >  	} else {
> >  		cond_resched();
> >  	}
> 
> There's a try_to_freeze() call just after this hunk. Is there a
> reason to invoke cond_resched(); before freezing?

try_to_freeze() can compile to {return false;}.  Even when not, it
usually does not more than tests a global variable.  So the presence of
try_to_freeze() has no bearing on the use of cond_resched().

> 
> 
> > @@ -870,9 +868,10 @@ void svc_recv(struct svc_rqst *rqstp)
> >  		struct svc_xprt *xprt = rqstp->rq_xprt;
> >  
> >  		/* Normally we will wait up to 5 seconds for any required
> > -		 * cache information to be provided.
> > +		 * cache information to be provided.  When there are no
> > +		 * idle threads, we reduce the wait time.
> >  		 */
> > -		if (!list_empty(&pool->sp_idle_threads))
> > +		if (pool->sp_idle_threads.first)
> >  			rqstp->rq_chandle.thread_wait = 5 * HZ;
> >  		else
> >  			rqstp->rq_chandle.thread_wait = 1 * HZ;
> > -- 
> > 2.40.1
> > 

I'll send a new batch today I hope.  I decided to move lwq into lib/
and make all the changes to llist in separate patches.

Thanks,
NeilBrown
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 609bde814cb0..a5f3badcb629 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1666,7 +1666,7 @@  static inline unsigned int __task_state_index(unsigned int tsk_state,
 
 	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
 
-	if (tsk_state == TASK_IDLE)
+	if ((tsk_state & TASK_IDLE) == TASK_IDLE)
 		state = TASK_REPORT_IDLE;
 
 	/*
@@ -1674,7 +1674,7 @@  static inline unsigned int __task_state_index(unsigned int tsk_state,
 	 * to userspace, we can make this appear as if the task has gone through
 	 * a regular rt_mutex_lock() call.
 	 */
-	if (tsk_state == TASK_RTLOCK_WAIT)
+	if (tsk_state & TASK_RTLOCK_WAIT)
 		state = TASK_UNINTERRUPTIBLE;
 
 	return fls(state);
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 22b3018ebf62..5216f95411e3 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -37,7 +37,7 @@  struct svc_pool {
 	struct list_head	sp_sockets;	/* pending sockets */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
 	struct list_head	sp_all_threads;	/* all server threads */
-	struct list_head	sp_idle_threads; /* idle server threads */
+	struct llist_head	sp_idle_threads; /* idle server threads */
 
 	/* statistics on pool operation */
 	struct percpu_counter	sp_messages_arrived;
@@ -186,7 +186,7 @@  extern u32 svc_max_payload(const struct svc_rqst *rqstp);
  */
 struct svc_rqst {
 	struct list_head	rq_all;		/* all threads list */
-	struct list_head	rq_idle;	/* On the idle list */
+	struct llist_node	rq_idle;	/* On the idle list */
 	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
 	struct svc_xprt *	rq_xprt;	/* transport ptr */
 
@@ -270,22 +270,24 @@  enum {
  * svc_thread_set_busy - mark a thread as busy
  * @rqstp: the thread which is now busy
  *
- * If rq_idle is "empty", the thread must be busy.
+ * By convention a thread is busy if rq_idle.next points to rq_idle.
+ * This ensures it is not on the idle list.
  */
 static inline void svc_thread_set_busy(struct svc_rqst *rqstp)
 {
-	INIT_LIST_HEAD(&rqstp->rq_idle);
+	rqstp->rq_idle.next = &rqstp->rq_idle;
 }
 
 /**
  * svc_thread_busy - check if a thread as busy
  * @rqstp: the thread which might be busy
  *
- * If rq_idle is "empty", the thread must be busy.
+ * By convention a thread is busy if rq_idle.next points to rq_idle.
+ * This ensures it is not on the idle list.
  */
 static inline bool svc_thread_busy(struct svc_rqst *rqstp)
 {
-	return list_empty(&rqstp->rq_idle);
+	return rqstp->rq_idle.next == &rqstp->rq_idle;
 }
 
 #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 051f08c48418..addbf28ea50a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -510,7 +510,7 @@  __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 		pool->sp_id = i;
 		INIT_LIST_HEAD(&pool->sp_sockets);
 		INIT_LIST_HEAD(&pool->sp_all_threads);
-		INIT_LIST_HEAD(&pool->sp_idle_threads);
+		init_llist_head(&pool->sp_idle_threads);
 		spin_lock_init(&pool->sp_lock);
 
 		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
@@ -701,15 +701,16 @@  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 void svc_pool_wake_idle_thread(struct svc_pool *pool)
 {
 	struct svc_rqst	*rqstp;
+	struct llist_node *ln;
 
 	rcu_read_lock();
 	spin_lock_bh(&pool->sp_lock);
-	rqstp = list_first_entry_or_null(&pool->sp_idle_threads,
-					 struct svc_rqst, rq_idle);
-	if (rqstp)
-		list_del_init(&rqstp->rq_idle);
+	ln = llist_del_first(&pool->sp_idle_threads);
 	spin_unlock_bh(&pool->sp_lock);
-	if (rqstp) {
+	if (ln) {
+		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
+		svc_thread_set_busy(rqstp);
+
 		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
 		wake_up_process(rqstp->rq_task);
 		rcu_read_unlock();
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index fa0d854a5596..81327001e074 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -715,10 +715,6 @@  rqst_should_sleep(struct svc_rqst *rqstp)
 	if (svc_thread_should_stop(rqstp))
 		return false;
 
-	/* are we freezing? */
-	if (freezing(current))
-		return false;
-
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
 	if (svc_is_backchannel(rqstp)) {
 		if (!list_empty(&rqstp->rq_server->sv_cb_list))
@@ -734,30 +730,32 @@  static void svc_rqst_wait_for_work(struct svc_rqst *rqstp)
 	struct svc_pool *pool = rqstp->rq_pool;
 
 	if (rqst_should_sleep(rqstp)) {
-		set_current_state(TASK_IDLE);
-		spin_lock_bh(&pool->sp_lock);
-		list_add(&rqstp->rq_idle, &pool->sp_idle_threads);
-		spin_unlock_bh(&pool->sp_lock);
+		set_current_state(TASK_IDLE | TASK_FREEZABLE);
+		llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
+
+		if (unlikely(!rqst_should_sleep(rqstp)))
+			/* maybe there were no idle threads when some work
+			 * became ready and so nothing was woken.  We've just
+			 * become idle so someone can to the work - maybe us.
+			 * But we cannot reliably remove ourselves from the
+			 * idle list - we can only remove the first task which
+			 * might be us, and might not.
+			 * So remove and wake it, then schedule().  If it was
+			 * us, we won't sleep.  If it is some other thread, they
+			 * will do the work.
+			 */
+			svc_pool_wake_idle_thread(pool);
 
-		/* Need to check should_sleep() again after
-		 * setting task state in case a wakeup happened
-		 * between testing and setting.
+		/* We mustn't continue while on the idle list, and we
+		 * cannot remove outselves reliably.  The only "work"
+		 * we can do while on the idle list is to freeze.
+		 * So loop until someone removes us
 		 */
-		if (rqst_should_sleep(rqstp)) {
+		while (!svc_thread_busy(rqstp)) {
 			schedule();
-		} else {
-			__set_current_state(TASK_RUNNING);
-			cond_resched();
-		}
-
-		/* We *must* be removed from the list before we can continue.
-		 * If we were woken, this is already done
-		 */
-		if (!svc_thread_busy(rqstp)) {
-			spin_lock_bh(&pool->sp_lock);
-			list_del_init(&rqstp->rq_idle);
-			spin_unlock_bh(&pool->sp_lock);
+			set_current_state(TASK_IDLE | TASK_FREEZABLE);
 		}
+		__set_current_state(TASK_RUNNING);
 	} else {
 		cond_resched();
 	}
@@ -870,9 +868,10 @@  void svc_recv(struct svc_rqst *rqstp)
 		struct svc_xprt *xprt = rqstp->rq_xprt;
 
 		/* Normally we will wait up to 5 seconds for any required
-		 * cache information to be provided.
+		 * cache information to be provided.  When there are no
+		 * idle threads, we reduce the wait time.
 		 */
-		if (!list_empty(&pool->sp_idle_threads))
+		if (pool->sp_idle_threads.first)
 			rqstp->rq_chandle.thread_wait = 5 * HZ;
 		else
 			rqstp->rq_chandle.thread_wait = 1 * HZ;