diff mbox series

[v3,4/7] rcu: Improve handling of synchronize_rcu() users

Message ID 20231128080033.288050-5-urezki@gmail.com (mailing list archive)
State New, archived
Headers show
Series Reduce synchronize_rcu() latency(V3) | expand

Commit Message

Uladzislau Rezki Nov. 28, 2023, 8 a.m. UTC
From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>

Currently, processing of the next batch of rcu_synchronize nodes
for the new grace period, requires doing a llist reversal operation
to find the tail element of the list. This can be a very costly
operation (high number of cache misses) for a long list.

To address this, this patch introduces a "dummy-wait-node" entity.
At every grace period init, a new wait node is added to the llist.
This wait node is used as wait tail for this new grace period.

This allows lockless additions of new rcu_synchronize nodes in the
rcu_sr_normal_add_req(), while the cleanup work executes and does
the progress. The dummy nodes are removed on next round of cleanup
work execution.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---
 kernel/rcu/tree.c | 270 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 233 insertions(+), 37 deletions(-)

Comments

Paul E. McKenney Dec. 20, 2023, 1:37 a.m. UTC | #1
On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> 
> Currently, processing of the next batch of rcu_synchronize nodes
> for the new grace period, requires doing a llist reversal operation
> to find the tail element of the list. This can be a very costly
> operation (high number of cache misses) for a long list.
> 
> To address this, this patch introduces a "dummy-wait-node" entity.
> At every grace period init, a new wait node is added to the llist.
> This wait node is used as wait tail for this new grace period.
> 
> This allows lockless additions of new rcu_synchronize nodes in the
> rcu_sr_normal_add_req(), while the cleanup work executes and does
> the progress. The dummy nodes are removed on next round of cleanup
> work execution.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>

This says that Uladzislau created the patch and that Neeraj
acted as maintainer.  I am guessing that you both worked on it,
in which case is should have the Co-developed-by tags as shown in
Documentation/process/submitting-patches.rst.  Could you please update
these to reflect the actual origin?

One question below toward the end.  There are probably others that I
should be asking, but I have to start somewhere.  ;-)

						Thanx, Paul

> ---
>  kernel/rcu/tree.c | 270 +++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 233 insertions(+), 37 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 975621ef40e3..d7b48996825f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1384,25 +1384,173 @@ static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  }
>  
> +#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
> +
> +struct sr_wait_node {
> +	atomic_t inuse;
> +	struct llist_node node;
> +};
> +
>  /*
> - * There are three lists for handling synchronize_rcu() users.
> - * A first list corresponds to new coming users, second for users
> - * which wait for a grace period and third is for which a grace
> - * period is passed.
> + * There is a single llist, which is used for handling
> + * synchronize_rcu() users' enqueued rcu_synchronize nodes.
> + * Within this llist, there are two tail pointers:
> + *
> + * wait tail: Tracks the set of nodes, which need to
> + *            wait for the current GP to complete.
> + * done tail: Tracks the set of nodes, for which grace
> + *            period has elapsed. These nodes processing
> + *            will be done as part of the cleanup work
> + *            execution by a kworker.
> + *
> + * At every grace period init, a new wait node is added
> + * to the llist. This wait node is used as wait tail
> + * for this new grace period. Given that there are a fixed
> + * number of wait nodes, if all wait nodes are in use
> + * (which can happen when kworker callback processing
> + * is delayed) and additional grace period is requested.
> + * This means, a system is slow in processing callbacks.
> + *
> + * TODO: If a slow processing is detected, a first node
> + * in the llist should be used as a wait-tail for this
> + * grace period, therefore users which should wait due
> + * to a slow process are handled by _this_ grace period
> + * and not next.
> + *
> + * Below is an illustration of how the done and wait
> + * tail pointers move from one set of rcu_synchronize nodes
> + * to the other, as grace periods start and finish and
> + * nodes are processed by kworker.
> + *
> + *
> + * a. Initial llist callbacks list:
> + *
> + * +----------+           +--------+          +-------+
> + * |          |           |        |          |       |
> + * |   head   |---------> |   cb2  |--------->| cb1   |
> + * |          |           |        |          |       |
> + * +----------+           +--------+          +-------+
> + *
> + *
> + *
> + * b. New GP1 Start:
> + *
> + *                    WAIT TAIL
> + *                      |
> + *                      |
> + *                      v
> + * +----------+     +--------+      +--------+        +-------+
> + * |          |     |        |      |        |        |       |
> + * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
> + * |          |     | head1  |      |        |        |       |
> + * +----------+     +--------+      +--------+        +-------+
> + *
> + *
> + *
> + * c. GP completion:
> + *
> + * WAIT_TAIL == DONE_TAIL
> + *
> + *                   DONE TAIL
> + *                     |
> + *                     |
> + *                     v
> + * +----------+     +--------+      +--------+        +-------+
> + * |          |     |        |      |        |        |       |
> + * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
> + * |          |     | head1  |      |        |        |       |
> + * +----------+     +--------+      +--------+        +-------+
> + *
> + *
> + *
> + * d. New callbacks and GP2 start:
> + *
> + *                    WAIT TAIL                          DONE TAIL
> + *                      |                                 |
> + *                      |                                 |
> + *                      v                                 v
> + * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
> + * |          |     |      |    |      |    |      |    |     |    |     |    |     |
> + * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
> + * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
> + * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
> + *
> + *
> + *
> + * e. GP2 completion:
> + *
> + * WAIT_TAIL == DONE_TAIL
> + *                   DONE TAIL
> + *                      |
> + *                      |
> + *                      v
> + * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
> + * |          |     |      |    |      |    |      |    |     |    |     |    |     |
> + * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
> + * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
> + * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
> + *
> + *
> + * While the llist state transitions from d to e, a kworker
> + * can start executing rcu_sr_normal_gp_cleanup_work() and
> + * can observe either the old done tail (@c) or the new
> + * done tail (@e). So, done tail updates and reads need
> + * to use the rel-acq semantics. If the concurrent kworker
> + * observes the old done tail, the newly queued work
> + * execution will process the updated done tail. If the
> + * concurrent kworker observes the new done tail, then
> + * the newly queued work will skip processing the done
> + * tail, as workqueue semantics guarantees that the new
> + * work is executed only after the previous one completes.
> + *
> + * f. kworker callbacks processing complete:
> + *
> + *
> + *                   DONE TAIL
> + *                     |
> + *                     |
> + *                     v
> + * +----------+     +--------+
> + * |          |     |        |
> + * |   head   ------> wait   |
> + * |          |     | head2  |
> + * +----------+     +--------+
> + *
>   */
>  static struct sr_normal_state {
>  	struct llist_head srs_next;	/* request a GP users. */
> -	struct llist_head srs_wait;	/* wait for GP users. */
> -	struct llist_head srs_done;	/* ready for GP users. */
> -
> -	/*
> -	 * In order to add a batch of nodes to already
> -	 * existing srs-done-list, a tail of srs-wait-list
> -	 * is maintained.
> -	 */
> -	struct llist_node *srs_wait_tail;
> +	struct llist_node *srs_wait_tail; /* wait for GP users. */
> +	struct llist_node *srs_done_tail; /* ready for GP users. */
> +	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
>  } sr;
>  
> +static bool rcu_sr_is_wait_head(struct llist_node *node)
> +{
> +	return &(sr.srs_wait_nodes)[0].node <= node &&
> +		node <= &(sr.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
> +}
> +
> +static struct llist_node *rcu_sr_get_wait_head(void)
> +{
> +	struct sr_wait_node *sr_wn;
> +	int i;
> +
> +	for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
> +		sr_wn = &(sr.srs_wait_nodes)[i];
> +
> +		if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
> +			return &sr_wn->node;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void rcu_sr_put_wait_head(struct llist_node *node)
> +{
> +	struct sr_wait_node *sr_wn = container_of(node, struct sr_wait_node, node);
> +	atomic_set_release(&sr_wn->inuse, 0);
> +}
> +
>  /* Disabled by default. */
>  static int rcu_normal_wake_from_gp;
>  module_param(rcu_normal_wake_from_gp, int, 0644);
> @@ -1423,14 +1571,44 @@ static void rcu_sr_normal_complete(struct llist_node *node)
>  
>  static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
>  {
> -	struct llist_node *done, *rcu, *next;
> +	struct llist_node *done, *rcu, *next, *head;
>  
> -	done = llist_del_all(&sr.srs_done);
> +	/*
> +	 * This work execution can potentially execute
> +	 * while a new done tail is being updated by
> +	 * grace period kthread in rcu_sr_normal_gp_cleanup().
> +	 * So, read and updates of done tail need to
> +	 * follow acq-rel semantics.
> +	 *
> +	 * Given that wq semantics guarantees that a single work
> +	 * cannot execute concurrently by multiple kworkers,
> +	 * the done tail list manipulations are protected here.
> +	 */
> +	done = smp_load_acquire(&sr.srs_done_tail);
>  	if (!done)
>  		return;
>  
> -	llist_for_each_safe(rcu, next, done)
> -		rcu_sr_normal_complete(rcu);
> +	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
> +	head = done->next;
> +	done->next = NULL;
> +
> +	/*
> +	 * The dummy node, which is pointed to by the
> +	 * done tail which is acq-read above is not removed
> +	 * here.  This allows lockless additions of new
> +	 * rcu_synchronize nodes in rcu_sr_normal_add_req(),
> +	 * while the cleanup work executes. The dummy
> +	 * nodes is removed, in next round of cleanup
> +	 * work execution.
> +	 */
> +	llist_for_each_safe(rcu, next, head) {
> +		if (!rcu_sr_is_wait_head(rcu)) {
> +			rcu_sr_normal_complete(rcu);
> +			continue;
> +		}
> +
> +		rcu_sr_put_wait_head(rcu);
> +	}
>  }
>  static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
>  
> @@ -1439,43 +1617,56 @@ static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
>   */
>  static void rcu_sr_normal_gp_cleanup(void)
>  {
> -	struct llist_node *head, *tail;
> +	struct llist_node *wait_tail;
>  
> -	if (llist_empty(&sr.srs_wait))
> +	wait_tail = sr.srs_wait_tail;
> +	if (wait_tail == NULL)
>  		return;
>  
> -	tail = READ_ONCE(sr.srs_wait_tail);
> -	head = __llist_del_all(&sr.srs_wait);
> +	sr.srs_wait_tail = NULL;
> +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
>  
> -	if (head) {
> -		/* Can be not empty. */
> -		llist_add_batch(head, tail, &sr.srs_done);
> +	// concurrent sr_normal_gp_cleanup work might observe this update.
> +	smp_store_release(&sr.srs_done_tail, wait_tail);
> +	ASSERT_EXCLUSIVE_WRITER(sr.srs_done_tail);
> +
> +	if (wait_tail)
>  		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
> -	}
>  }
>  
>  /*
>   * Helper function for rcu_gp_init().
>   */
> -static void rcu_sr_normal_gp_init(void)
> +static bool rcu_sr_normal_gp_init(void)
>  {
> -	struct llist_node *head, *tail;
> +	struct llist_node *first;
> +	struct llist_node *wait_head;
> +	bool start_new_poll = false;
>  
> -	if (llist_empty(&sr.srs_next))
> -		return;
> +	first = READ_ONCE(sr.srs_next.first);
> +	if (!first || rcu_sr_is_wait_head(first))
> +		return start_new_poll;
> +
> +	wait_head = rcu_sr_get_wait_head();
> +	if (!wait_head) {
> +		// Kick another GP to retry.
> +		start_new_poll = true;
> +		return start_new_poll;
> +	}
>  
> -	tail = llist_del_all(&sr.srs_next);
> -	head = llist_reverse_order(tail);
> +	/* Inject a wait-dummy-node. */
> +	llist_add(wait_head, &sr.srs_next);
>  
>  	/*
> -	 * A waiting list of GP should be empty on this step,
> -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> +	 * A waiting list of rcu_synchronize nodes should be empty on
> +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
>  	 * rolls it over. If not, it is a BUG, warn a user.
>  	 */
> -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> +	sr.srs_wait_tail = wait_head;
> +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
>  
> -	WRITE_ONCE(sr.srs_wait_tail, tail);
> -	__llist_add_batch(head, tail, &sr.srs_wait);
> +	return start_new_poll;
>  }
>  
>  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
>  	unsigned long mask;
>  	struct rcu_data *rdp;
>  	struct rcu_node *rnp = rcu_get_root();
> +	bool start_new_poll;
>  
>  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
>  	raw_spin_lock_irq_rcu_node(rnp);
> @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
>  	/* Record GP times before starting GP, hence rcu_seq_start(). */
>  	rcu_seq_start(&rcu_state.gp_seq);
>  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> -	rcu_sr_normal_gp_init();
> +	start_new_poll = rcu_sr_normal_gp_init();
>  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
>  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
>  	raw_spin_unlock_irq_rcu_node(rnp);
>  
> +	// New poll request after rnp unlock
> +	if (start_new_poll)
> +		(void) start_poll_synchronize_rcu();

You lost me on this one.  Anything that got moved to the wait list
should be handled by the current grace period, right?  Or is the
problem that rcu_sr_normal_gp_init() is being invoked after the call
to rcu_seq_start()?  If that is the case, could it be moved ahead so
that we don't need the extra grace period?

Or am I missing something subtle here?

> +
>  	/*
>  	 * Apply per-leaf buffered online and offline operations to
>  	 * the rcu_node tree. Note that this new grace period need not
> -- 
> 2.39.2
>
Uladzislau Rezki Dec. 21, 2023, 10:52 a.m. UTC | #2
On Tue, Dec 19, 2023 at 05:37:56PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > 
> > Currently, processing of the next batch of rcu_synchronize nodes
> > for the new grace period, requires doing a llist reversal operation
> > to find the tail element of the list. This can be a very costly
> > operation (high number of cache misses) for a long list.
> > 
> > To address this, this patch introduces a "dummy-wait-node" entity.
> > At every grace period init, a new wait node is added to the llist.
> > This wait node is used as wait tail for this new grace period.
> > 
> > This allows lockless additions of new rcu_synchronize nodes in the
> > rcu_sr_normal_add_req(), while the cleanup work executes and does
> > the progress. The dummy nodes are removed on next round of cleanup
> > work execution.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> 
> This says that Uladzislau created the patch and that Neeraj
> acted as maintainer.  I am guessing that you both worked on it,
> in which case is should have the Co-developed-by tags as shown in
> Documentation/process/submitting-patches.rst.  Could you please update
> these to reflect the actual origin?
> 
Right. We both worked on it. Neeraj is an author whereas i should mark
myself as a Co-developed-by. This is a correct way. Thank you for
pointing on it!

>
> One question below toward the end.  There are probably others that I
> should be asking, but I have to start somewhere.  ;-)
> 
Good :)

> >  
> >  /*
> >   * Helper function for rcu_gp_init().
> >   */
> > -static void rcu_sr_normal_gp_init(void)
> > +static bool rcu_sr_normal_gp_init(void)
> >  {
> > -	struct llist_node *head, *tail;
> > +	struct llist_node *first;
> > +	struct llist_node *wait_head;
> > +	bool start_new_poll = false;
> >  
> > -	if (llist_empty(&sr.srs_next))
> > -		return;
> > +	first = READ_ONCE(sr.srs_next.first);
> > +	if (!first || rcu_sr_is_wait_head(first))
> > +		return start_new_poll;
> > +
> > +	wait_head = rcu_sr_get_wait_head();
> > +	if (!wait_head) {
> > +		// Kick another GP to retry.
> > +		start_new_poll = true;
> > +		return start_new_poll;
> > +	}
> >  
> > -	tail = llist_del_all(&sr.srs_next);
> > -	head = llist_reverse_order(tail);
> > +	/* Inject a wait-dummy-node. */
> > +	llist_add(wait_head, &sr.srs_next);
> >  
> >  	/*
> > -	 * A waiting list of GP should be empty on this step,
> > -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > +	 * A waiting list of rcu_synchronize nodes should be empty on
> > +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> >  	 * rolls it over. If not, it is a BUG, warn a user.
> >  	 */
> > -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> > +	sr.srs_wait_tail = wait_head;
> > +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> >  
> > -	WRITE_ONCE(sr.srs_wait_tail, tail);
> > -	__llist_add_batch(head, tail, &sr.srs_wait);
> > +	return start_new_poll;
> >  }
> >  
> >  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> >  	unsigned long mask;
> >  	struct rcu_data *rdp;
> >  	struct rcu_node *rnp = rcu_get_root();
> > +	bool start_new_poll;
> >  
> >  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
> >  	raw_spin_lock_irq_rcu_node(rnp);
> > @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
> >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> >  	rcu_seq_start(&rcu_state.gp_seq);
> >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > -	rcu_sr_normal_gp_init();
> > +	start_new_poll = rcu_sr_normal_gp_init();
> >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> >  	raw_spin_unlock_irq_rcu_node(rnp);
> >  
> > +	// New poll request after rnp unlock
> > +	if (start_new_poll)
> > +		(void) start_poll_synchronize_rcu();
> 
> You lost me on this one.  Anything that got moved to the wait list
> should be handled by the current grace period, right?  Or is the
> problem that rcu_sr_normal_gp_init() is being invoked after the call
> to rcu_seq_start()?  If that is the case, could it be moved ahead so
> that we don't need the extra grace period?
> 
> Or am I missing something subtle here?
> 
The problem is that, we are limited in number of "wait-heads" which we
add as a marker node for this/current grace period. If there are more clients
and there is no a wait-head available it means that a system, the deferred
kworker, is slow in processing callbacks, thus all wait-nodes are in use.

That is why we need an extra grace period. Basically to repeat our try one
more time, i.e. it might be that a current grace period is not able to handle
users due to the fact that a system is doing really slow, but this is rather
a corner case and is not a problem.

--
Uladzislau Rezki
Paul E. McKenney Dec. 21, 2023, 6:40 p.m. UTC | #3
On Thu, Dec 21, 2023 at 11:52:33AM +0100, Uladzislau Rezki wrote:
> On Tue, Dec 19, 2023 at 05:37:56PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> > > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > 
> > > Currently, processing of the next batch of rcu_synchronize nodes
> > > for the new grace period, requires doing a llist reversal operation
> > > to find the tail element of the list. This can be a very costly
> > > operation (high number of cache misses) for a long list.
> > > 
> > > To address this, this patch introduces a "dummy-wait-node" entity.
> > > At every grace period init, a new wait node is added to the llist.
> > > This wait node is used as wait tail for this new grace period.
> > > 
> > > This allows lockless additions of new rcu_synchronize nodes in the
> > > rcu_sr_normal_add_req(), while the cleanup work executes and does
> > > the progress. The dummy nodes are removed on next round of cleanup
> > > work execution.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > 
> > This says that Uladzislau created the patch and that Neeraj
> > acted as maintainer.  I am guessing that you both worked on it,
> > in which case is should have the Co-developed-by tags as shown in
> > Documentation/process/submitting-patches.rst.  Could you please update
> > these to reflect the actual origin?
> > 
> Right. We both worked on it. Neeraj is an author whereas i should mark
> myself as a Co-developed-by. This is a correct way. Thank you for
> pointing on it!

Sounds good, thank you!

> > One question below toward the end.  There are probably others that I
> > should be asking, but I have to start somewhere.  ;-)
> > 
> Good :)
> 
> > >  
> > >  /*
> > >   * Helper function for rcu_gp_init().
> > >   */
> > > -static void rcu_sr_normal_gp_init(void)
> > > +static bool rcu_sr_normal_gp_init(void)
> > >  {
> > > -	struct llist_node *head, *tail;
> > > +	struct llist_node *first;
> > > +	struct llist_node *wait_head;
> > > +	bool start_new_poll = false;
> > >  
> > > -	if (llist_empty(&sr.srs_next))
> > > -		return;
> > > +	first = READ_ONCE(sr.srs_next.first);
> > > +	if (!first || rcu_sr_is_wait_head(first))
> > > +		return start_new_poll;
> > > +
> > > +	wait_head = rcu_sr_get_wait_head();
> > > +	if (!wait_head) {
> > > +		// Kick another GP to retry.
> > > +		start_new_poll = true;
> > > +		return start_new_poll;
> > > +	}
> > >  
> > > -	tail = llist_del_all(&sr.srs_next);
> > > -	head = llist_reverse_order(tail);
> > > +	/* Inject a wait-dummy-node. */
> > > +	llist_add(wait_head, &sr.srs_next);
> > >  
> > >  	/*
> > > -	 * A waiting list of GP should be empty on this step,
> > > -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > +	 * A waiting list of rcu_synchronize nodes should be empty on
> > > +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > >  	 * rolls it over. If not, it is a BUG, warn a user.
> > >  	 */
> > > -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> > > +	sr.srs_wait_tail = wait_head;
> > > +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> > >  
> > > -	WRITE_ONCE(sr.srs_wait_tail, tail);
> > > -	__llist_add_batch(head, tail, &sr.srs_wait);
> > > +	return start_new_poll;
> > >  }
> > >  
> > >  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >  	unsigned long mask;
> > >  	struct rcu_data *rdp;
> > >  	struct rcu_node *rnp = rcu_get_root();
> > > +	bool start_new_poll;
> > >  
> > >  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
> > >  	raw_spin_lock_irq_rcu_node(rnp);
> > > @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > >  	rcu_seq_start(&rcu_state.gp_seq);
> > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > -	rcu_sr_normal_gp_init();
> > > +	start_new_poll = rcu_sr_normal_gp_init();
> > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > >  
> > > +	// New poll request after rnp unlock
> > > +	if (start_new_poll)
> > > +		(void) start_poll_synchronize_rcu();
> > 
> > You lost me on this one.  Anything that got moved to the wait list
> > should be handled by the current grace period, right?  Or is the
> > problem that rcu_sr_normal_gp_init() is being invoked after the call
> > to rcu_seq_start()?  If that is the case, could it be moved ahead so
> > that we don't need the extra grace period?
> > 
> > Or am I missing something subtle here?
> > 
> The problem is that, we are limited in number of "wait-heads" which we
> add as a marker node for this/current grace period. If there are more clients
> and there is no a wait-head available it means that a system, the deferred
> kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> 
> That is why we need an extra grace period. Basically to repeat our try one
> more time, i.e. it might be that a current grace period is not able to handle
> users due to the fact that a system is doing really slow, but this is rather
> a corner case and is not a problem.

But in that case, the real issue is not the need for an extra grace
period, but rather the need for the wakeup processing to happen, correct?
Or am I missing something subtle here?

							Thanx, Paul
Uladzislau Rezki Dec. 22, 2023, 9:27 a.m. UTC | #4
On Thu, Dec 21, 2023 at 10:40:21AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 21, 2023 at 11:52:33AM +0100, Uladzislau Rezki wrote:
> > On Tue, Dec 19, 2023 at 05:37:56PM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> > > > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > > 
> > > > Currently, processing of the next batch of rcu_synchronize nodes
> > > > for the new grace period, requires doing a llist reversal operation
> > > > to find the tail element of the list. This can be a very costly
> > > > operation (high number of cache misses) for a long list.
> > > > 
> > > > To address this, this patch introduces a "dummy-wait-node" entity.
> > > > At every grace period init, a new wait node is added to the llist.
> > > > This wait node is used as wait tail for this new grace period.
> > > > 
> > > > This allows lockless additions of new rcu_synchronize nodes in the
> > > > rcu_sr_normal_add_req(), while the cleanup work executes and does
> > > > the progress. The dummy nodes are removed on next round of cleanup
> > > > work execution.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > 
> > > This says that Uladzislau created the patch and that Neeraj
> > > acted as maintainer.  I am guessing that you both worked on it,
> > > in which case is should have the Co-developed-by tags as shown in
> > > Documentation/process/submitting-patches.rst.  Could you please update
> > > these to reflect the actual origin?
> > > 
> > Right. We both worked on it. Neeraj is an author whereas i should mark
> > myself as a Co-developed-by. This is a correct way. Thank you for
> > pointing on it!
> 
> Sounds good, thank you!
> 
> > > One question below toward the end.  There are probably others that I
> > > should be asking, but I have to start somewhere.  ;-)
> > > 
> > Good :)
> > 
> > > >  
> > > >  /*
> > > >   * Helper function for rcu_gp_init().
> > > >   */
> > > > -static void rcu_sr_normal_gp_init(void)
> > > > +static bool rcu_sr_normal_gp_init(void)
> > > >  {
> > > > -	struct llist_node *head, *tail;
> > > > +	struct llist_node *first;
> > > > +	struct llist_node *wait_head;
> > > > +	bool start_new_poll = false;
> > > >  
> > > > -	if (llist_empty(&sr.srs_next))
> > > > -		return;
> > > > +	first = READ_ONCE(sr.srs_next.first);
> > > > +	if (!first || rcu_sr_is_wait_head(first))
> > > > +		return start_new_poll;
> > > > +
> > > > +	wait_head = rcu_sr_get_wait_head();
> > > > +	if (!wait_head) {
> > > > +		// Kick another GP to retry.
> > > > +		start_new_poll = true;
> > > > +		return start_new_poll;
> > > > +	}
> > > >  
> > > > -	tail = llist_del_all(&sr.srs_next);
> > > > -	head = llist_reverse_order(tail);
> > > > +	/* Inject a wait-dummy-node. */
> > > > +	llist_add(wait_head, &sr.srs_next);
> > > >  
> > > >  	/*
> > > > -	 * A waiting list of GP should be empty on this step,
> > > > -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > > +	 * A waiting list of rcu_synchronize nodes should be empty on
> > > > +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > >  	 * rolls it over. If not, it is a BUG, warn a user.
> > > >  	 */
> > > > -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > > +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> > > > +	sr.srs_wait_tail = wait_head;
> > > > +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> > > >  
> > > > -	WRITE_ONCE(sr.srs_wait_tail, tail);
> > > > -	__llist_add_batch(head, tail, &sr.srs_wait);
> > > > +	return start_new_poll;
> > > >  }
> > > >  
> > > >  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > > @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > >  	unsigned long mask;
> > > >  	struct rcu_data *rdp;
> > > >  	struct rcu_node *rnp = rcu_get_root();
> > > > +	bool start_new_poll;
> > > >  
> > > >  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
> > > >  	raw_spin_lock_irq_rcu_node(rnp);
> > > > @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > >  	rcu_seq_start(&rcu_state.gp_seq);
> > > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > -	rcu_sr_normal_gp_init();
> > > > +	start_new_poll = rcu_sr_normal_gp_init();
> > > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > > >  
> > > > +	// New poll request after rnp unlock
> > > > +	if (start_new_poll)
> > > > +		(void) start_poll_synchronize_rcu();
> > > 
> > > You lost me on this one.  Anything that got moved to the wait list
> > > should be handled by the current grace period, right?  Or is the
> > > problem that rcu_sr_normal_gp_init() is being invoked after the call
> > > to rcu_seq_start()?  If that is the case, could it be moved ahead so
> > > that we don't need the extra grace period?
> > > 
> > > Or am I missing something subtle here?
> > > 
> > The problem is that, we are limited in number of "wait-heads" which we
> > add as a marker node for this/current grace period. If there are more clients
> > and there is no a wait-head available it means that a system, the deferred
> > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > 
> > That is why we need an extra grace period. Basically to repeat our try one
> > more time, i.e. it might be that a current grace period is not able to handle
> > users due to the fact that a system is doing really slow, but this is rather
> > a corner case and is not a problem.
> 
> But in that case, the real issue is not the need for an extra grace
> period, but rather the need for the wakeup processing to happen, correct?
> Or am I missing something subtle here?
> 
Basically, yes. If we had a spare dummy-node we could process the users
by the current GP(no need in extra). Why we may not have it - it is because
like you pointed:

- wake-up issue, i.e. wake-up time + when we are on_cpu;
- slow list process. For example priority. The kworker is not
  given enough CPU time to do the progress, thus "dummy-nodes"
  are not released in time for reuse.

Therefore, en extra GP is requested if there is a high flow of
synchronize_rcu() users and kworker is not able to do a progress
in time.

For example 60K+ parallel synchronize_rcu() users will trigger it.

--
Uladzislau Rezki
Paul E. McKenney Dec. 22, 2023, 6:58 p.m. UTC | #5
On Fri, Dec 22, 2023 at 10:27:41AM +0100, Uladzislau Rezki wrote:
> On Thu, Dec 21, 2023 at 10:40:21AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 21, 2023 at 11:52:33AM +0100, Uladzislau Rezki wrote:
> > > On Tue, Dec 19, 2023 at 05:37:56PM -0800, Paul E. McKenney wrote:
> > > > On Tue, Nov 28, 2023 at 09:00:30AM +0100, Uladzislau Rezki (Sony) wrote:
> > > > > From: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > > > 
> > > > > Currently, processing of the next batch of rcu_synchronize nodes
> > > > > for the new grace period, requires doing a llist reversal operation
> > > > > to find the tail element of the list. This can be a very costly
> > > > > operation (high number of cache misses) for a long list.
> > > > > 
> > > > > To address this, this patch introduces a "dummy-wait-node" entity.
> > > > > At every grace period init, a new wait node is added to the llist.
> > > > > This wait node is used as wait tail for this new grace period.
> > > > > 
> > > > > This allows lockless additions of new rcu_synchronize nodes in the
> > > > > rcu_sr_normal_add_req(), while the cleanup work executes and does
> > > > > the progress. The dummy nodes are removed on next round of cleanup
> > > > > work execution.
> > > > > 
> > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> > > > 
> > > > This says that Uladzislau created the patch and that Neeraj
> > > > acted as maintainer.  I am guessing that you both worked on it,
> > > > in which case is should have the Co-developed-by tags as shown in
> > > > Documentation/process/submitting-patches.rst.  Could you please update
> > > > these to reflect the actual origin?
> > > > 
> > > Right. We both worked on it. Neeraj is an author whereas i should mark
> > > myself as a Co-developed-by. This is a correct way. Thank you for
> > > pointing on it!
> > 
> > Sounds good, thank you!
> > 
> > > > One question below toward the end.  There are probably others that I
> > > > should be asking, but I have to start somewhere.  ;-)
> > > > 
> > > Good :)
> > > 
> > > > >  
> > > > >  /*
> > > > >   * Helper function for rcu_gp_init().
> > > > >   */
> > > > > -static void rcu_sr_normal_gp_init(void)
> > > > > +static bool rcu_sr_normal_gp_init(void)
> > > > >  {
> > > > > -	struct llist_node *head, *tail;
> > > > > +	struct llist_node *first;
> > > > > +	struct llist_node *wait_head;
> > > > > +	bool start_new_poll = false;
> > > > >  
> > > > > -	if (llist_empty(&sr.srs_next))
> > > > > -		return;
> > > > > +	first = READ_ONCE(sr.srs_next.first);
> > > > > +	if (!first || rcu_sr_is_wait_head(first))
> > > > > +		return start_new_poll;
> > > > > +
> > > > > +	wait_head = rcu_sr_get_wait_head();
> > > > > +	if (!wait_head) {
> > > > > +		// Kick another GP to retry.
> > > > > +		start_new_poll = true;
> > > > > +		return start_new_poll;
> > > > > +	}
> > > > >  
> > > > > -	tail = llist_del_all(&sr.srs_next);
> > > > > -	head = llist_reverse_order(tail);
> > > > > +	/* Inject a wait-dummy-node. */
> > > > > +	llist_add(wait_head, &sr.srs_next);
> > > > >  
> > > > >  	/*
> > > > > -	 * A waiting list of GP should be empty on this step,
> > > > > -	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > > > +	 * A waiting list of rcu_synchronize nodes should be empty on
> > > > > +	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
> > > > >  	 * rolls it over. If not, it is a BUG, warn a user.
> > > > >  	 */
> > > > > -	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
> > > > > +	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
> > > > > +	sr.srs_wait_tail = wait_head;
> > > > > +	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
> > > > >  
> > > > > -	WRITE_ONCE(sr.srs_wait_tail, tail);
> > > > > -	__llist_add_batch(head, tail, &sr.srs_wait);
> > > > > +	return start_new_poll;
> > > > >  }
> > > > >  
> > > > >  static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> > > > > @@ -1493,6 +1684,7 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > > >  	unsigned long mask;
> > > > >  	struct rcu_data *rdp;
> > > > >  	struct rcu_node *rnp = rcu_get_root();
> > > > > +	bool start_new_poll;
> > > > >  
> > > > >  	WRITE_ONCE(rcu_state.gp_activity, jiffies);
> > > > >  	raw_spin_lock_irq_rcu_node(rnp);
> > > > > @@ -1517,11 +1709,15 @@ static noinline_for_stack bool rcu_gp_init(void)
> > > > >  	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > >  	rcu_seq_start(&rcu_state.gp_seq);
> > > > >  	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > -	rcu_sr_normal_gp_init();
> > > > > +	start_new_poll = rcu_sr_normal_gp_init();
> > > > >  	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > > > >  	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > > > >  	raw_spin_unlock_irq_rcu_node(rnp);
> > > > >  
> > > > > +	// New poll request after rnp unlock
> > > > > +	if (start_new_poll)
> > > > > +		(void) start_poll_synchronize_rcu();
> > > > 
> > > > You lost me on this one.  Anything that got moved to the wait list
> > > > should be handled by the current grace period, right?  Or is the
> > > > problem that rcu_sr_normal_gp_init() is being invoked after the call
> > > > to rcu_seq_start()?  If that is the case, could it be moved ahead so
> > > > that we don't need the extra grace period?
> > > > 
> > > > Or am I missing something subtle here?
> > > > 
> > > The problem is that, we are limited in number of "wait-heads" which we
> > > add as a marker node for this/current grace period. If there are more clients
> > > and there is no a wait-head available it means that a system, the deferred
> > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > 
> > > That is why we need an extra grace period. Basically to repeat our try one
> > > more time, i.e. it might be that a current grace period is not able to handle
> > > users due to the fact that a system is doing really slow, but this is rather
> > > a corner case and is not a problem.
> > 
> > But in that case, the real issue is not the need for an extra grace
> > period, but rather the need for the wakeup processing to happen, correct?
> > Or am I missing something subtle here?
> > 
> Basically, yes. If we had a spare dummy-node we could process the users
> by the current GP(no need in extra). Why we may not have it - it is because
> like you pointed:
> 
> - wake-up issue, i.e. wake-up time + when we are on_cpu;
> - slow list process. For example priority. The kworker is not
>   given enough CPU time to do the progress, thus "dummy-nodes"
>   are not released in time for reuse.
> 
> Therefore, en extra GP is requested if there is a high flow of
> synchronize_rcu() users and kworker is not able to do a progress
> in time.
> 
> For example 60K+ parallel synchronize_rcu() users will trigger it.

OK, but what bad thing would happen if that was moved to precede the
rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
would be the same as the one that is just now starting.

Something like this?

	start_new_poll = rcu_sr_normal_gp_init();

	/* Record GP times before starting GP, hence rcu_seq_start(). */
	rcu_seq_start(&rcu_state.gp_seq);
	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);

	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
	raw_spin_unlock_irq_rcu_node(rnp);

	// New poll request after rnp unlock
	if (start_new_poll)
		(void) start_poll_synchronize_rcu();

Yes, rcu_sr_normal_gp_init() might need some adjustment given that it
is seeing the pre-GP value of rcu_state.gp_seq.

But unless I am missing something, what you have now can result in
extra grace periods, which incur overhead on what would otherwise be an
idle system.

							Thanx, Paul
Uladzislau Rezki Jan. 2, 2024, 12:52 p.m. UTC | #6
Hello, Paul!

Sorry for late answer, it is because of holidays :)

> > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > add as a marker node for this/current grace period. If there are more clients
> > > > and there is no a wait-head available it means that a system, the deferred
> > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > 
> > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > users due to the fact that a system is doing really slow, but this is rather
> > > > a corner case and is not a problem.
> > > 
> > > But in that case, the real issue is not the need for an extra grace
> > > period, but rather the need for the wakeup processing to happen, correct?
> > > Or am I missing something subtle here?
> > > 
> > Basically, yes. If we had a spare dummy-node we could process the users
> > by the current GP(no need in extra). Why we may not have it - it is because
> > like you pointed:
> > 
> > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > - slow list process. For example priority. The kworker is not
> >   given enough CPU time to do the progress, thus "dummy-nodes"
> >   are not released in time for reuse.
> > 
> > Therefore, en extra GP is requested if there is a high flow of
> > synchronize_rcu() users and kworker is not able to do a progress
> > in time.
> > 
> > For example 60K+ parallel synchronize_rcu() users will trigger it.
> 
> OK, but what bad thing would happen if that was moved to precede the
> rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> would be the same as the one that is just now starting.
> 
> Something like this?
> 
> 	start_new_poll = rcu_sr_normal_gp_init();
> 
> 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> 	rcu_seq_start(&rcu_state.gp_seq);
> 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
>
I had a concern about the case when rcu_sr_normal_gp_init() handles what
we currently have, in terms of requests. Right after that there is/are
extra sync requests which invoke the start_poll_synchronize_rcu() but
since a GP has been requested before it will not request an extra one. So
"last" incoming users might not be processed.

That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
updated.

I can miss something, so please comment. Apart of that we can move it
as you proposed.

--
Uladzislau Rezki
Paul E. McKenney Jan. 2, 2024, 7:25 p.m. UTC | #7
On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> Hello, Paul!
> 
> Sorry for late answer, it is because of holidays :)
> 
> > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > 
> > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > a corner case and is not a problem.
> > > > 
> > > > But in that case, the real issue is not the need for an extra grace
> > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > Or am I missing something subtle here?
> > > > 
> > > Basically, yes. If we had a spare dummy-node we could process the users
> > > by the current GP(no need in extra). Why we may not have it - it is because
> > > like you pointed:
> > > 
> > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > - slow list process. For example priority. The kworker is not
> > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > >   are not released in time for reuse.
> > > 
> > > Therefore, en extra GP is requested if there is a high flow of
> > > synchronize_rcu() users and kworker is not able to do a progress
> > > in time.
> > > 
> > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > 
> > OK, but what bad thing would happen if that was moved to precede the
> > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > would be the same as the one that is just now starting.
> > 
> > Something like this?
> > 
> > 	start_new_poll = rcu_sr_normal_gp_init();
> > 
> > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > 	rcu_seq_start(&rcu_state.gp_seq);
> > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> >
> I had a concern about the case when rcu_sr_normal_gp_init() handles what
> we currently have, in terms of requests. Right after that there is/are
> extra sync requests which invoke the start_poll_synchronize_rcu() but
> since a GP has been requested before it will not request an extra one. So
> "last" incoming users might not be processed.
> 
> That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> updated.
> 
> I can miss something, so please comment. Apart of that we can move it
> as you proposed.

Couldn't that possibility be handled by a check in rcu_gp_cleanup()?

							Thanx, Paul
Uladzislau Rezki Jan. 3, 2024, 1:16 p.m. UTC | #8
On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > Hello, Paul!
> > 
> > Sorry for late answer, it is because of holidays :)
> > 
> > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > 
> > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > a corner case and is not a problem.
> > > > > 
> > > > > But in that case, the real issue is not the need for an extra grace
> > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > Or am I missing something subtle here?
> > > > > 
> > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > like you pointed:
> > > > 
> > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > - slow list process. For example priority. The kworker is not
> > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > >   are not released in time for reuse.
> > > > 
> > > > Therefore, en extra GP is requested if there is a high flow of
> > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > in time.
> > > > 
> > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > 
> > > OK, but what bad thing would happen if that was moved to precede the
> > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > would be the same as the one that is just now starting.
> > > 
> > > Something like this?
> > > 
> > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > 
> > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > >
> > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > we currently have, in terms of requests. Right after that there is/are
> > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > since a GP has been requested before it will not request an extra one. So
> > "last" incoming users might not be processed.
> > 
> > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > updated.
> > 
> > I can miss something, so please comment. Apart of that we can move it
> > as you proposed.
> 
> Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> 
It is controlled by the caller anyway, i.e. if a new GP is needed.

I am not 100% sure it is as straightforward as it could look like to
handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
that we need to access to the first element of llist and find out if
it is a wait-dummy-head or not. If not we know there are extra incoming
calls.

So that way requires extra calling of start_poll_synchronize_rcu().

I can add a comment about your concern and we can find the best approach
later, if it is OK with you!

--
Uladzislau Rezki
Paul E. McKenney Jan. 3, 2024, 2:47 p.m. UTC | #9
On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > Hello, Paul!
> > > 
> > > Sorry for late answer, it is because of holidays :)
> > > 
> > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > 
> > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > a corner case and is not a problem.
> > > > > > 
> > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > Or am I missing something subtle here?
> > > > > > 
> > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > like you pointed:
> > > > > 
> > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > - slow list process. For example priority. The kworker is not
> > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > >   are not released in time for reuse.
> > > > > 
> > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > in time.
> > > > > 
> > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > 
> > > > OK, but what bad thing would happen if that was moved to precede the
> > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > would be the same as the one that is just now starting.
> > > > 
> > > > Something like this?
> > > > 
> > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > 
> > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > >
> > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > we currently have, in terms of requests. Right after that there is/are
> > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > since a GP has been requested before it will not request an extra one. So
> > > "last" incoming users might not be processed.
> > > 
> > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > updated.
> > > 
> > > I can miss something, so please comment. Apart of that we can move it
> > > as you proposed.
> > 
> > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > 
> It is controlled by the caller anyway, i.e. if a new GP is needed.
> 
> I am not 100% sure it is as straightforward as it could look like to
> handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> that we need to access to the first element of llist and find out if
> it is a wait-dummy-head or not. If not we know there are extra incoming
> calls.
> 
> So that way requires extra calling of start_poll_synchronize_rcu().

If this is invoked early enough in rcu_gp_cleanup(), all that needs to
happen is to set the need_gp flag.  Plus you can count the number of
requests, and snapshot that number at rcu_gp_init() time and check to
see if it changed at rcu_gp_cleanup() time.  Later on, this could be
used to reduce the number of wakeups, correct?

> I can add a comment about your concern and we can find the best approach
> later, if it is OK with you!

I agree that this should be added via a later patch, though I have not
yet given up on the possibility that this patch might be simple enough
to be later in this same series.

							Thanx, Paul
Uladzislau Rezki Jan. 3, 2024, 5:35 p.m. UTC | #10
On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > Hello, Paul!
> > > > 
> > > > Sorry for late answer, it is because of holidays :)
> > > > 
> > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > 
> > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > a corner case and is not a problem.
> > > > > > > 
> > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > Or am I missing something subtle here?
> > > > > > > 
> > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > like you pointed:
> > > > > > 
> > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > - slow list process. For example priority. The kworker is not
> > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > >   are not released in time for reuse.
> > > > > > 
> > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > in time.
> > > > > > 
> > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > 
> > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > would be the same as the one that is just now starting.
> > > > > 
> > > > > Something like this?
> > > > > 
> > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > 
> > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > >
> > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > we currently have, in terms of requests. Right after that there is/are
> > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > since a GP has been requested before it will not request an extra one. So
> > > > "last" incoming users might not be processed.
> > > > 
> > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > updated.
> > > > 
> > > > I can miss something, so please comment. Apart of that we can move it
> > > > as you proposed.
> > > 
> > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > 
> > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > 
> > I am not 100% sure it is as straightforward as it could look like to
> > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > that we need to access to the first element of llist and find out if
> > it is a wait-dummy-head or not. If not we know there are extra incoming
> > calls.
> > 
> > So that way requires extra calling of start_poll_synchronize_rcu().
> 
> If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> happen is to set the need_gp flag.  Plus you can count the number of
> requests, and snapshot that number at rcu_gp_init() time and check to
> see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> used to reduce the number of wakeups, correct?
> 
You mean instead of waking-up a gp-kthread just continue processing of
new users if they are exist? If so, i think, we can implement it as separate
patches.

> > I can add a comment about your concern and we can find the best approach
> > later, if it is OK with you!
> 
> I agree that this should be added via a later patch, though I have not
> yet given up on the possibility that this patch might be simple enough
> to be later in this same series.
> 
Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
function does not request any new gp, i.e. our approach does not do any extra GP
requests. It happens only if there are no any dummy-wait-head available as we
discussed it earlier.

--
Uladzislau Rezki
Paul E. McKenney Jan. 3, 2024, 5:56 p.m. UTC | #11
On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > Hello, Paul!
> > > > > 
> > > > > Sorry for late answer, it is because of holidays :)
> > > > > 
> > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > 
> > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > a corner case and is not a problem.
> > > > > > > > 
> > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > Or am I missing something subtle here?
> > > > > > > > 
> > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > like you pointed:
> > > > > > > 
> > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > >   are not released in time for reuse.
> > > > > > > 
> > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > in time.
> > > > > > > 
> > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > 
> > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > would be the same as the one that is just now starting.
> > > > > > 
> > > > > > Something like this?
> > > > > > 
> > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > 
> > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > >
> > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > since a GP has been requested before it will not request an extra one. So
> > > > > "last" incoming users might not be processed.
> > > > > 
> > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > updated.
> > > > > 
> > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > as you proposed.
> > > > 
> > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > 
> > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > 
> > > I am not 100% sure it is as straightforward as it could look like to
> > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > that we need to access to the first element of llist and find out if
> > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > calls.
> > > 
> > > So that way requires extra calling of start_poll_synchronize_rcu().
> > 
> > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > happen is to set the need_gp flag.  Plus you can count the number of
> > requests, and snapshot that number at rcu_gp_init() time and check to
> > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > used to reduce the number of wakeups, correct?
> > 
> You mean instead of waking-up a gp-kthread just continue processing of
> new users if they are exist? If so, i think, we can implement it as separate
> patches.

Agreed, this is an optimization, and thus should be a separate patch.

> > > I can add a comment about your concern and we can find the best approach
> > > later, if it is OK with you!
> > 
> > I agree that this should be added via a later patch, though I have not
> > yet given up on the possibility that this patch might be simple enough
> > to be later in this same series.
> > 
> Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> function does not request any new gp, i.e. our approach does not do any extra GP
> requests. It happens only if there are no any dummy-wait-head available as we
> discussed it earlier.

The start_poll_synchronize_rcu() added by your patch 4/7 will request
an additional grace period because it is invoked after rcu_seq_start()
is called, correct?  Or am I missing something subtle here?

							Thanx, Paul
Uladzislau Rezki Jan. 3, 2024, 7:02 p.m. UTC | #12
On Wed, Jan 03, 2024 at 09:56:42AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > > Hello, Paul!
> > > > > > 
> > > > > > Sorry for late answer, it is because of holidays :)
> > > > > > 
> > > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > > 
> > > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > > a corner case and is not a problem.
> > > > > > > > > 
> > > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > > Or am I missing something subtle here?
> > > > > > > > > 
> > > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > > like you pointed:
> > > > > > > > 
> > > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > > >   are not released in time for reuse.
> > > > > > > > 
> > > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > > in time.
> > > > > > > > 
> > > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > > 
> > > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > > would be the same as the one that is just now starting.
> > > > > > > 
> > > > > > > Something like this?
> > > > > > > 
> > > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > > 
> > > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > > >
> > > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > > since a GP has been requested before it will not request an extra one. So
> > > > > > "last" incoming users might not be processed.
> > > > > > 
> > > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > > updated.
> > > > > > 
> > > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > > as you proposed.
> > > > > 
> > > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > > 
> > > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > > 
> > > > I am not 100% sure it is as straightforward as it could look like to
> > > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > > that we need to access to the first element of llist and find out if
> > > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > > calls.
> > > > 
> > > > So that way requires extra calling of start_poll_synchronize_rcu().
> > > 
> > > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > > happen is to set the need_gp flag.  Plus you can count the number of
> > > requests, and snapshot that number at rcu_gp_init() time and check to
> > > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > > used to reduce the number of wakeups, correct?
> > > 
> > You mean instead of waking-up a gp-kthread just continue processing of
> > new users if they are exist? If so, i think, we can implement it as separate
> > patches.
> 
> Agreed, this is an optimization, and thus should be a separate patch.
> 
> > > > I can add a comment about your concern and we can find the best approach
> > > > later, if it is OK with you!
> > > 
> > > I agree that this should be added via a later patch, though I have not
> > > yet given up on the possibility that this patch might be simple enough
> > > to be later in this same series.
> > > 
> > Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> > function does not request any new gp, i.e. our approach does not do any extra GP
> > requests. It happens only if there are no any dummy-wait-head available as we
> > discussed it earlier.
> 
> The start_poll_synchronize_rcu() added by your patch 4/7 will request
> an additional grace period because it is invoked after rcu_seq_start()
> is called, correct?  Or am I missing something subtle here?
> 
<snip>
+       // New poll request after rnp unlock
+       if (start_new_poll)
+               (void) start_poll_synchronize_rcu();
+
<snip>

The "start_new_poll" is set to "true" only when _this_ GP is not able
to handle anything and there are outstanding users. It happens when the
rcu_sr_normal_gp_init() function was not able to insert a dummy separator
to the llist, because there were no left dummy-nodes(fixed number of them)
due to the fact that all of them are "in-use". The reason why there are no
dummy-nodes is because of slow progress because it is done by dedicated
kworker.

I can trigger it, i mean when we need an addition GP, start_new_pool is 1,
only when i run 20 000 processes concurrently in a tight loop:

<snip>
while (1)
  synchronize_rcu();
<snip>

in that scenario we start to ask for an addition GP because we are not up
to speed, i.e. a system is slow in processing callbacks and we need some
time until wait-node/nodes is/are released for reuse.

We need a next GP to move it forward, i.e. to repeat a try of attaching
a dummy-node.

--
Uladzislau Rezki
Uladzislau Rezki Jan. 3, 2024, 7:03 p.m. UTC | #13
On Wed, Jan 03, 2024 at 08:02:00PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 03, 2024 at 09:56:42AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> > > On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > > > Hello, Paul!
> > > > > > > 
> > > > > > > Sorry for late answer, it is because of holidays :)
> > > > > > > 
> > > > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > > > 
> > > > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > > > a corner case and is not a problem.
> > > > > > > > > > 
> > > > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > > > Or am I missing something subtle here?
> > > > > > > > > > 
> > > > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > > > like you pointed:
> > > > > > > > > 
> > > > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > > > >   are not released in time for reuse.
> > > > > > > > > 
> > > > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > > > in time.
> > > > > > > > > 
> > > > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > > > 
> > > > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > > > would be the same as the one that is just now starting.
> > > > > > > > 
> > > > > > > > Something like this?
> > > > > > > > 
> > > > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > > > 
> > > > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > > > >
> > > > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > > > since a GP has been requested before it will not request an extra one. So
> > > > > > > "last" incoming users might not be processed.
> > > > > > > 
> > > > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > > > updated.
> > > > > > > 
> > > > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > > > as you proposed.
> > > > > > 
> > > > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > > > 
> > > > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > > > 
> > > > > I am not 100% sure it is as straightforward as it could look like to
> > > > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > > > that we need to access to the first element of llist and find out if
> > > > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > > > calls.
> > > > > 
> > > > > So that way requires extra calling of start_poll_synchronize_rcu().
> > > > 
> > > > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > > > happen is to set the need_gp flag.  Plus you can count the number of
> > > > requests, and snapshot that number at rcu_gp_init() time and check to
> > > > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > > > used to reduce the number of wakeups, correct?
> > > > 
> > > You mean instead of waking-up a gp-kthread just continue processing of
> > > new users if they are exist? If so, i think, we can implement it as separate
> > > patches.
> > 
> > Agreed, this is an optimization, and thus should be a separate patch.
> > 
> > > > > I can add a comment about your concern and we can find the best approach
> > > > > later, if it is OK with you!
> > > > 
> > > > I agree that this should be added via a later patch, though I have not
> > > > yet given up on the possibility that this patch might be simple enough
> > > > to be later in this same series.
> > > > 
> > > Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> > > function does not request any new gp, i.e. our approach does not do any extra GP
> > > requests. It happens only if there are no any dummy-wait-head available as we
> > > discussed it earlier.
> > 
> > The start_poll_synchronize_rcu() added by your patch 4/7 will request
> > an additional grace period because it is invoked after rcu_seq_start()
> > is called, correct?  Or am I missing something subtle here?
> > 
> <snip>
> +       // New poll request after rnp unlock
> +       if (start_new_poll)
> +               (void) start_poll_synchronize_rcu();
> +
> <snip>
> 
> The "start_new_poll" is set to "true" only when _this_ GP is not able
> to handle anything and there are outstanding users. It happens when the
> rcu_sr_normal_gp_init() function was not able to insert a dummy separator
> to the llist, because there were no left dummy-nodes(fixed number of them)
> due to the fact that all of them are "in-use". The reason why there are no
> dummy-nodes is because of slow progress because it is done by dedicated
> kworker.
> 
> I can trigger it, i mean when we need an addition GP, start_new_pool is 1,
> only when i run 20 000 processes concurrently in a tight loop:
> 
> <snip>
> while (1)
>   synchronize_rcu();
> <snip>
> 
> in that scenario we start to ask for an addition GP because we are not up
> to speed, i.e. a system is slow in processing callbacks and we need some
> time until wait-node/nodes is/are released for reuse.
> 
> We need a next GP to move it forward, i.e. to repeat a try of attaching
> a dummy-node.
> 
Probably i should add a comment about it :)

--
Uladzislau Rezki
Paul E. McKenney Jan. 3, 2024, 7:33 p.m. UTC | #14
On Wed, Jan 03, 2024 at 08:03:10PM +0100, Uladzislau Rezki wrote:
> On Wed, Jan 03, 2024 at 08:02:00PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 03, 2024 at 09:56:42AM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> > > > On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > > > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > > > > Hello, Paul!
> > > > > > > > 
> > > > > > > > Sorry for late answer, it is because of holidays :)
> > > > > > > > 
> > > > > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > > > > 
> > > > > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > > > > a corner case and is not a problem.
> > > > > > > > > > > 
> > > > > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > > > > Or am I missing something subtle here?
> > > > > > > > > > > 
> > > > > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > > > > like you pointed:
> > > > > > > > > > 
> > > > > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > > > > >   are not released in time for reuse.
> > > > > > > > > > 
> > > > > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > > > > in time.
> > > > > > > > > > 
> > > > > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > > > > 
> > > > > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > > > > would be the same as the one that is just now starting.
> > > > > > > > > 
> > > > > > > > > Something like this?
> > > > > > > > > 
> > > > > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > > > > 
> > > > > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > > > > >
> > > > > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > > > > since a GP has been requested before it will not request an extra one. So
> > > > > > > > "last" incoming users might not be processed.
> > > > > > > > 
> > > > > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > > > > updated.
> > > > > > > > 
> > > > > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > > > > as you proposed.
> > > > > > > 
> > > > > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > > > > 
> > > > > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > > > > 
> > > > > > I am not 100% sure it is as straightforward as it could look like to
> > > > > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > > > > that we need to access to the first element of llist and find out if
> > > > > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > > > > calls.
> > > > > > 
> > > > > > So that way requires extra calling of start_poll_synchronize_rcu().
> > > > > 
> > > > > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > > > > happen is to set the need_gp flag.  Plus you can count the number of
> > > > > requests, and snapshot that number at rcu_gp_init() time and check to
> > > > > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > > > > used to reduce the number of wakeups, correct?
> > > > > 
> > > > You mean instead of waking-up a gp-kthread just continue processing of
> > > > new users if they are exist? If so, i think, we can implement it as separate
> > > > patches.
> > > 
> > > Agreed, this is an optimization, and thus should be a separate patch.
> > > 
> > > > > > I can add a comment about your concern and we can find the best approach
> > > > > > later, if it is OK with you!
> > > > > 
> > > > > I agree that this should be added via a later patch, though I have not
> > > > > yet given up on the possibility that this patch might be simple enough
> > > > > to be later in this same series.
> > > > > 
> > > > Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> > > > function does not request any new gp, i.e. our approach does not do any extra GP
> > > > requests. It happens only if there are no any dummy-wait-head available as we
> > > > discussed it earlier.
> > > 
> > > The start_poll_synchronize_rcu() added by your patch 4/7 will request
> > > an additional grace period because it is invoked after rcu_seq_start()
> > > is called, correct?  Or am I missing something subtle here?
> > > 
> > <snip>
> > +       // New poll request after rnp unlock
> > +       if (start_new_poll)
> > +               (void) start_poll_synchronize_rcu();
> > +
> > <snip>
> > 
> > The "start_new_poll" is set to "true" only when _this_ GP is not able
> > to handle anything and there are outstanding users. It happens when the
> > rcu_sr_normal_gp_init() function was not able to insert a dummy separator
> > to the llist, because there were no left dummy-nodes(fixed number of them)
> > due to the fact that all of them are "in-use". The reason why there are no
> > dummy-nodes is because of slow progress because it is done by dedicated
> > kworker.
> > 
> > I can trigger it, i mean when we need an addition GP, start_new_pool is 1,
> > only when i run 20 000 processes concurrently in a tight loop:
> > 
> > <snip>
> > while (1)
> >   synchronize_rcu();
> > <snip>
> > 
> > in that scenario we start to ask for an addition GP because we are not up
> > to speed, i.e. a system is slow in processing callbacks and we need some
> > time until wait-node/nodes is/are released for reuse.
> > 
> > We need a next GP to move it forward, i.e. to repeat a try of attaching
> > a dummy-node.
> > 
> Probably i should add a comment about it :)

Sounds good, and thank you for bearing with me!

							Thanx, Paul
Uladzislau Rezki Jan. 4, 2024, 11:17 a.m. UTC | #15
On Wed, Jan 03, 2024 at 11:33:01AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 08:03:10PM +0100, Uladzislau Rezki wrote:
> > On Wed, Jan 03, 2024 at 08:02:00PM +0100, Uladzislau Rezki wrote:
> > > On Wed, Jan 03, 2024 at 09:56:42AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Jan 03, 2024 at 06:35:20PM +0100, Uladzislau Rezki wrote:
> > > > > On Wed, Jan 03, 2024 at 06:47:30AM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Jan 03, 2024 at 02:16:00PM +0100, Uladzislau Rezki wrote:
> > > > > > > On Tue, Jan 02, 2024 at 11:25:13AM -0800, Paul E. McKenney wrote:
> > > > > > > > On Tue, Jan 02, 2024 at 01:52:26PM +0100, Uladzislau Rezki wrote:
> > > > > > > > > Hello, Paul!
> > > > > > > > > 
> > > > > > > > > Sorry for late answer, it is because of holidays :)
> > > > > > > > > 
> > > > > > > > > > > > > The problem is that, we are limited in number of "wait-heads" which we
> > > > > > > > > > > > > add as a marker node for this/current grace period. If there are more clients
> > > > > > > > > > > > > and there is no a wait-head available it means that a system, the deferred
> > > > > > > > > > > > > kworker, is slow in processing callbacks, thus all wait-nodes are in use.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > That is why we need an extra grace period. Basically to repeat our try one
> > > > > > > > > > > > > more time, i.e. it might be that a current grace period is not able to handle
> > > > > > > > > > > > > users due to the fact that a system is doing really slow, but this is rather
> > > > > > > > > > > > > a corner case and is not a problem.
> > > > > > > > > > > > 
> > > > > > > > > > > > But in that case, the real issue is not the need for an extra grace
> > > > > > > > > > > > period, but rather the need for the wakeup processing to happen, correct?
> > > > > > > > > > > > Or am I missing something subtle here?
> > > > > > > > > > > > 
> > > > > > > > > > > Basically, yes. If we had a spare dummy-node we could process the users
> > > > > > > > > > > by the current GP(no need in extra). Why we may not have it - it is because
> > > > > > > > > > > like you pointed:
> > > > > > > > > > > 
> > > > > > > > > > > - wake-up issue, i.e. wake-up time + when we are on_cpu;
> > > > > > > > > > > - slow list process. For example priority. The kworker is not
> > > > > > > > > > >   given enough CPU time to do the progress, thus "dummy-nodes"
> > > > > > > > > > >   are not released in time for reuse.
> > > > > > > > > > > 
> > > > > > > > > > > Therefore, en extra GP is requested if there is a high flow of
> > > > > > > > > > > synchronize_rcu() users and kworker is not able to do a progress
> > > > > > > > > > > in time.
> > > > > > > > > > > 
> > > > > > > > > > > For example 60K+ parallel synchronize_rcu() users will trigger it.
> > > > > > > > > > 
> > > > > > > > > > OK, but what bad thing would happen if that was moved to precede the
> > > > > > > > > > rcu_seq_start(&rcu_state.gp_seq)?  That way, the requested grace period
> > > > > > > > > > would be the same as the one that is just now starting.
> > > > > > > > > > 
> > > > > > > > > > Something like this?
> > > > > > > > > > 
> > > > > > > > > > 	start_new_poll = rcu_sr_normal_gp_init();
> > > > > > > > > > 
> > > > > > > > > > 	/* Record GP times before starting GP, hence rcu_seq_start(). */
> > > > > > > > > > 	rcu_seq_start(&rcu_state.gp_seq);
> > > > > > > > > > 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > > > > > > > >
> > > > > > > > > I had a concern about the case when rcu_sr_normal_gp_init() handles what
> > > > > > > > > we currently have, in terms of requests. Right after that there is/are
> > > > > > > > > extra sync requests which invoke the start_poll_synchronize_rcu() but
> > > > > > > > > since a GP has been requested before it will not request an extra one. So
> > > > > > > > > "last" incoming users might not be processed.
> > > > > > > > > 
> > > > > > > > > That is why i have placed the rcu_sr_normal_gp_init() after a gp_seq is
> > > > > > > > > updated.
> > > > > > > > > 
> > > > > > > > > I can miss something, so please comment. Apart of that we can move it
> > > > > > > > > as you proposed.
> > > > > > > > 
> > > > > > > > Couldn't that possibility be handled by a check in rcu_gp_cleanup()?
> > > > > > > > 
> > > > > > > It is controlled by the caller anyway, i.e. if a new GP is needed.
> > > > > > > 
> > > > > > > I am not 100% sure it is as straightforward as it could look like to
> > > > > > > handle it in the rcu_sr_normal_gp_cleaup() function. At least i see
> > > > > > > that we need to access to the first element of llist and find out if
> > > > > > > it is a wait-dummy-head or not. If not we know there are extra incoming
> > > > > > > calls.
> > > > > > > 
> > > > > > > So that way requires extra calling of start_poll_synchronize_rcu().
> > > > > > 
> > > > > > If this is invoked early enough in rcu_gp_cleanup(), all that needs to
> > > > > > happen is to set the need_gp flag.  Plus you can count the number of
> > > > > > requests, and snapshot that number at rcu_gp_init() time and check to
> > > > > > see if it changed at rcu_gp_cleanup() time.  Later on, this could be
> > > > > > used to reduce the number of wakeups, correct?
> > > > > > 
> > > > > You mean instead of waking-up a gp-kthread just continue processing of
> > > > > new users if they are exist? If so, i think, we can implement it as separate
> > > > > patches.
> > > > 
> > > > Agreed, this is an optimization, and thus should be a separate patch.
> > > > 
> > > > > > > I can add a comment about your concern and we can find the best approach
> > > > > > > later, if it is OK with you!
> > > > > > 
> > > > > > I agree that this should be added via a later patch, though I have not
> > > > > > yet given up on the possibility that this patch might be simple enough
> > > > > > to be later in this same series.
> > > > > > 
> > > > > Maybe there is a small misunderstanding. Please note, the rcu_sr_normal_gp_init() 
> > > > > function does not request any new gp, i.e. our approach does not do any extra GP
> > > > > requests. It happens only if there are no any dummy-wait-head available as we
> > > > > discussed it earlier.
> > > > 
> > > > The start_poll_synchronize_rcu() added by your patch 4/7 will request
> > > > an additional grace period because it is invoked after rcu_seq_start()
> > > > is called, correct?  Or am I missing something subtle here?
> > > > 
> > > <snip>
> > > +       // New poll request after rnp unlock
> > > +       if (start_new_poll)
> > > +               (void) start_poll_synchronize_rcu();
> > > +
> > > <snip>
> > > 
> > > The "start_new_poll" is set to "true" only when _this_ GP is not able
> > > to handle anything and there are outstanding users. It happens when the
> > > rcu_sr_normal_gp_init() function was not able to insert a dummy separator
> > > to the llist, because there were no left dummy-nodes(fixed number of them)
> > > due to the fact that all of them are "in-use". The reason why there are no
> > > dummy-nodes is because of slow progress because it is done by dedicated
> > > kworker.
> > > 
> > > I can trigger it, i mean when we need an addition GP, start_new_pool is 1,
> > > only when i run 20 000 processes concurrently in a tight loop:
> > > 
> > > <snip>
> > > while (1)
> > >   synchronize_rcu();
> > > <snip>
> > > 
> > > in that scenario we start to ask for an addition GP because we are not up
> > > to speed, i.e. a system is slow in processing callbacks and we need some
> > > time until wait-node/nodes is/are released for reuse.
> > > 
> > > We need a next GP to move it forward, i.e. to repeat a try of attaching
> > > a dummy-node.
> > > 
> > Probably i should add a comment about it :)
> 
> Sounds good, and thank you for bearing with me!
> 
Thanks to you :)

--
Uladzislau Rezki
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 975621ef40e3..d7b48996825f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1384,25 +1384,173 @@  static void rcu_poll_gp_seq_end_unlocked(unsigned long *snap)
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
+#define SR_NORMAL_GP_WAIT_HEAD_MAX 5
+
+struct sr_wait_node {
+	atomic_t inuse;
+	struct llist_node node;
+};
+
 /*
- * There are three lists for handling synchronize_rcu() users.
- * A first list corresponds to new coming users, second for users
- * which wait for a grace period and third is for which a grace
- * period is passed.
+ * There is a single llist, which is used for handling
+ * synchronize_rcu() users' enqueued rcu_synchronize nodes.
+ * Within this llist, there are two tail pointers:
+ *
+ * wait tail: Tracks the set of nodes, which need to
+ *            wait for the current GP to complete.
+ * done tail: Tracks the set of nodes, for which grace
+ *            period has elapsed. These nodes processing
+ *            will be done as part of the cleanup work
+ *            execution by a kworker.
+ *
+ * At every grace period init, a new wait node is added
+ * to the llist. This wait node is used as wait tail
+ * for this new grace period. Given that there are a fixed
+ * number of wait nodes, if all wait nodes are in use
+ * (which can happen when kworker callback processing
+ * is delayed) and additional grace period is requested.
+ * This means, a system is slow in processing callbacks.
+ *
+ * TODO: If a slow processing is detected, a first node
+ * in the llist should be used as a wait-tail for this
+ * grace period, therefore users which should wait due
+ * to a slow process are handled by _this_ grace period
+ * and not next.
+ *
+ * Below is an illustration of how the done and wait
+ * tail pointers move from one set of rcu_synchronize nodes
+ * to the other, as grace periods start and finish and
+ * nodes are processed by kworker.
+ *
+ *
+ * a. Initial llist callbacks list:
+ *
+ * +----------+           +--------+          +-------+
+ * |          |           |        |          |       |
+ * |   head   |---------> |   cb2  |--------->| cb1   |
+ * |          |           |        |          |       |
+ * +----------+           +--------+          +-------+
+ *
+ *
+ *
+ * b. New GP1 Start:
+ *
+ *                    WAIT TAIL
+ *                      |
+ *                      |
+ *                      v
+ * +----------+     +--------+      +--------+        +-------+
+ * |          |     |        |      |        |        |       |
+ * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
+ * |          |     | head1  |      |        |        |       |
+ * +----------+     +--------+      +--------+        +-------+
+ *
+ *
+ *
+ * c. GP completion:
+ *
+ * WAIT_TAIL == DONE_TAIL
+ *
+ *                   DONE TAIL
+ *                     |
+ *                     |
+ *                     v
+ * +----------+     +--------+      +--------+        +-------+
+ * |          |     |        |      |        |        |       |
+ * |   head   ------> wait   |------>   cb2  |------> |  cb1  |
+ * |          |     | head1  |      |        |        |       |
+ * +----------+     +--------+      +--------+        +-------+
+ *
+ *
+ *
+ * d. New callbacks and GP2 start:
+ *
+ *                    WAIT TAIL                          DONE TAIL
+ *                      |                                 |
+ *                      |                                 |
+ *                      v                                 v
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ * |          |     |      |    |      |    |      |    |     |    |     |    |     |
+ * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
+ * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ *
+ *
+ *
+ * e. GP2 completion:
+ *
+ * WAIT_TAIL == DONE_TAIL
+ *                   DONE TAIL
+ *                      |
+ *                      |
+ *                      v
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ * |          |     |      |    |      |    |      |    |     |    |     |    |     |
+ * |   head   ------> wait |--->|  cb4 |--->| cb3  |--->|wait |--->| cb2 |--->| cb1 |
+ * |          |     | head2|    |      |    |      |    |head1|    |     |    |     |
+ * +----------+     +------+    +------+    +------+    +-----+    +-----+    +-----+
+ *
+ *
+ * While the llist state transitions from d to e, a kworker
+ * can start executing rcu_sr_normal_gp_cleanup_work() and
+ * can observe either the old done tail (@c) or the new
+ * done tail (@e). So, done tail updates and reads need
+ * to use the rel-acq semantics. If the concurrent kworker
+ * observes the old done tail, the newly queued work
+ * execution will process the updated done tail. If the
+ * concurrent kworker observes the new done tail, then
+ * the newly queued work will skip processing the done
+ * tail, as workqueue semantics guarantees that the new
+ * work is executed only after the previous one completes.
+ *
+ * f. kworker callbacks processing complete:
+ *
+ *
+ *                   DONE TAIL
+ *                     |
+ *                     |
+ *                     v
+ * +----------+     +--------+
+ * |          |     |        |
+ * |   head   ------> wait   |
+ * |          |     | head2  |
+ * +----------+     +--------+
+ *
  */
 static struct sr_normal_state {
 	struct llist_head srs_next;	/* request a GP users. */
-	struct llist_head srs_wait;	/* wait for GP users. */
-	struct llist_head srs_done;	/* ready for GP users. */
-
-	/*
-	 * In order to add a batch of nodes to already
-	 * existing srs-done-list, a tail of srs-wait-list
-	 * is maintained.
-	 */
-	struct llist_node *srs_wait_tail;
+	struct llist_node *srs_wait_tail; /* wait for GP users. */
+	struct llist_node *srs_done_tail; /* ready for GP users. */
+	struct sr_wait_node srs_wait_nodes[SR_NORMAL_GP_WAIT_HEAD_MAX];
 } sr;
 
+static bool rcu_sr_is_wait_head(struct llist_node *node)
+{
+	return &(sr.srs_wait_nodes)[0].node <= node &&
+		node <= &(sr.srs_wait_nodes)[SR_NORMAL_GP_WAIT_HEAD_MAX - 1].node;
+}
+
+static struct llist_node *rcu_sr_get_wait_head(void)
+{
+	struct sr_wait_node *sr_wn;
+	int i;
+
+	for (i = 0; i < SR_NORMAL_GP_WAIT_HEAD_MAX; i++) {
+		sr_wn = &(sr.srs_wait_nodes)[i];
+
+		if (!atomic_cmpxchg_acquire(&sr_wn->inuse, 0, 1))
+			return &sr_wn->node;
+	}
+
+	return NULL;
+}
+
+static void rcu_sr_put_wait_head(struct llist_node *node)
+{
+	struct sr_wait_node *sr_wn = container_of(node, struct sr_wait_node, node);
+	atomic_set_release(&sr_wn->inuse, 0);
+}
+
 /* Disabled by default. */
 static int rcu_normal_wake_from_gp;
 module_param(rcu_normal_wake_from_gp, int, 0644);
@@ -1423,14 +1571,44 @@  static void rcu_sr_normal_complete(struct llist_node *node)
 
 static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
 {
-	struct llist_node *done, *rcu, *next;
+	struct llist_node *done, *rcu, *next, *head;
 
-	done = llist_del_all(&sr.srs_done);
+	/*
+	 * This work execution can potentially execute
+	 * while a new done tail is being updated by
+	 * grace period kthread in rcu_sr_normal_gp_cleanup().
+	 * So, read and updates of done tail need to
+	 * follow acq-rel semantics.
+	 *
+	 * Given that wq semantics guarantees that a single work
+	 * cannot execute concurrently by multiple kworkers,
+	 * the done tail list manipulations are protected here.
+	 */
+	done = smp_load_acquire(&sr.srs_done_tail);
 	if (!done)
 		return;
 
-	llist_for_each_safe(rcu, next, done)
-		rcu_sr_normal_complete(rcu);
+	WARN_ON_ONCE(!rcu_sr_is_wait_head(done));
+	head = done->next;
+	done->next = NULL;
+
+	/*
+	 * The dummy node, which is pointed to by the
+	 * done tail which is acq-read above is not removed
+	 * here.  This allows lockless additions of new
+	 * rcu_synchronize nodes in rcu_sr_normal_add_req(),
+	 * while the cleanup work executes. The dummy
+	 * nodes is removed, in next round of cleanup
+	 * work execution.
+	 */
+	llist_for_each_safe(rcu, next, head) {
+		if (!rcu_sr_is_wait_head(rcu)) {
+			rcu_sr_normal_complete(rcu);
+			continue;
+		}
+
+		rcu_sr_put_wait_head(rcu);
+	}
 }
 static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
 
@@ -1439,43 +1617,56 @@  static DECLARE_WORK(sr_normal_gp_cleanup, rcu_sr_normal_gp_cleanup_work);
  */
 static void rcu_sr_normal_gp_cleanup(void)
 {
-	struct llist_node *head, *tail;
+	struct llist_node *wait_tail;
 
-	if (llist_empty(&sr.srs_wait))
+	wait_tail = sr.srs_wait_tail;
+	if (wait_tail == NULL)
 		return;
 
-	tail = READ_ONCE(sr.srs_wait_tail);
-	head = __llist_del_all(&sr.srs_wait);
+	sr.srs_wait_tail = NULL;
+	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
 
-	if (head) {
-		/* Can be not empty. */
-		llist_add_batch(head, tail, &sr.srs_done);
+	// concurrent sr_normal_gp_cleanup work might observe this update.
+	smp_store_release(&sr.srs_done_tail, wait_tail);
+	ASSERT_EXCLUSIVE_WRITER(sr.srs_done_tail);
+
+	if (wait_tail)
 		queue_work(system_highpri_wq, &sr_normal_gp_cleanup);
-	}
 }
 
 /*
  * Helper function for rcu_gp_init().
  */
-static void rcu_sr_normal_gp_init(void)
+static bool rcu_sr_normal_gp_init(void)
 {
-	struct llist_node *head, *tail;
+	struct llist_node *first;
+	struct llist_node *wait_head;
+	bool start_new_poll = false;
 
-	if (llist_empty(&sr.srs_next))
-		return;
+	first = READ_ONCE(sr.srs_next.first);
+	if (!first || rcu_sr_is_wait_head(first))
+		return start_new_poll;
+
+	wait_head = rcu_sr_get_wait_head();
+	if (!wait_head) {
+		// Kick another GP to retry.
+		start_new_poll = true;
+		return start_new_poll;
+	}
 
-	tail = llist_del_all(&sr.srs_next);
-	head = llist_reverse_order(tail);
+	/* Inject a wait-dummy-node. */
+	llist_add(wait_head, &sr.srs_next);
 
 	/*
-	 * A waiting list of GP should be empty on this step,
-	 * since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
+	 * A waiting list of rcu_synchronize nodes should be empty on
+	 * this step, since a GP-kthread, rcu_gp_init() -> gp_cleanup(),
 	 * rolls it over. If not, it is a BUG, warn a user.
 	 */
-	WARN_ON_ONCE(!llist_empty(&sr.srs_wait));
+	WARN_ON_ONCE(sr.srs_wait_tail != NULL);
+	sr.srs_wait_tail = wait_head;
+	ASSERT_EXCLUSIVE_WRITER(sr.srs_wait_tail);
 
-	WRITE_ONCE(sr.srs_wait_tail, tail);
-	__llist_add_batch(head, tail, &sr.srs_wait);
+	return start_new_poll;
 }
 
 static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
@@ -1493,6 +1684,7 @@  static noinline_for_stack bool rcu_gp_init(void)
 	unsigned long mask;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp = rcu_get_root();
+	bool start_new_poll;
 
 	WRITE_ONCE(rcu_state.gp_activity, jiffies);
 	raw_spin_lock_irq_rcu_node(rnp);
@@ -1517,11 +1709,15 @@  static noinline_for_stack bool rcu_gp_init(void)
 	/* Record GP times before starting GP, hence rcu_seq_start(). */
 	rcu_seq_start(&rcu_state.gp_seq);
 	ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
-	rcu_sr_normal_gp_init();
+	start_new_poll = rcu_sr_normal_gp_init();
 	trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
 	rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
 	raw_spin_unlock_irq_rcu_node(rnp);
 
+	// New poll request after rnp unlock
+	if (start_new_poll)
+		(void) start_poll_synchronize_rcu();
+
 	/*
 	 * Apply per-leaf buffered online and offline operations to
 	 * the rcu_node tree. Note that this new grace period need not