[v6,1/4] rcu: Make call_rcu() lazy to save power

Message ID	20220922220104.2446868-2-joel@joelfernandes.org (mailing list archive)
State	Superseded
Headers	show Return-Path: <rcu-owner@kernel.org> From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: rcu@vger.kernel.org Cc: linux-kernel@vger.kernel.org, rushikesh.s.kadam@intel.com, urezki@gmail.com, neeraj.iitr10@gmail.com, frederic@kernel.org, paulmck@kernel.org, rostedt@goodmis.org, "Joel Fernandes (Google)" <joel@joelfernandes.org> Subject: [PATCH v6 1/4] rcu: Make call_rcu() lazy to save power Date: Thu, 22 Sep 2022 22:01:01 +0000 Message-Id: <20220922220104.2446868-2-joel@joelfernandes.org> In-Reply-To: <20220922220104.2446868-1-joel@joelfernandes.org> References: <20220922220104.2446868-1-joel@joelfernandes.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	rcu: call_rcu() power improvements \| expand [v6,0/4] rcu: call_rcu() power improvements [v6,1/4] rcu: Make call_rcu() lazy to save power [v6,2/4] rcu: shrinker for lazy rcu [v6,3/4] rcuscale: Add laziness and kfree tests [v6,4/4] percpu-refcount: Use call_rcu_flush() for atomic switch

Joel Fernandes Sept. 22, 2022, 10:01 p.m. UTC

Implement timer-based RCU lazy callback batching. The batch is flushed
whenever a certain amount of time has passed, or the batch on a
particular CPU grows too big. Also memory pressure will flush it in a
future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

v5->v6:

[ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
  deferral levels wake much earlier so for those it is not needed. ]

[ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]

[ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]

[ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]

[ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]

[ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]

Suggested-by: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/rcupdate.h |   7 ++
 kernel/rcu/Kconfig       |   8 ++
 kernel/rcu/rcu.h         |   8 ++
 kernel/rcu/tiny.c        |   2 +-
 kernel/rcu/tree.c        | 133 ++++++++++++++++++----------
 kernel/rcu/tree.h        |  17 +++-
 kernel/rcu/tree_exp.h    |   2 +-
 kernel/rcu/tree_nocb.h   | 184 ++++++++++++++++++++++++++++++++-------
 8 files changed, 277 insertions(+), 84 deletions(-)

Paul E. McKenney Sept. 23, 2022, 9:44 p.m. UTC | #1

On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
> Implement timer-based RCU lazy callback batching. The batch is flushed
> whenever a certain amount of time has passed, or the batch on a
> particular CPU grows too big. Also memory pressure will flush it in a
> future patch.
> 
> To handle several corner cases automagically (such as rcu_barrier() and
> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> length has the lazy CB length included in it. A separate lazy CB length
> counter is also introduced to keep track of the number of lazy CBs.
> 
> v5->v6:
> 
> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
>   deferral levels wake much earlier so for those it is not needed. ]
> 
> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> 
> [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> 
> [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> 
> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> 
> [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> 
> Suggested-by: Paul McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

I am going to put these on a local branch for testing, but a few comments
and questions interspersed below.

							Thanx, Paul

> ---
>  include/linux/rcupdate.h |   7 ++
>  kernel/rcu/Kconfig       |   8 ++
>  kernel/rcu/rcu.h         |   8 ++
>  kernel/rcu/tiny.c        |   2 +-
>  kernel/rcu/tree.c        | 133 ++++++++++++++++++----------
>  kernel/rcu/tree.h        |  17 +++-
>  kernel/rcu/tree_exp.h    |   2 +-
>  kernel/rcu/tree_nocb.h   | 184 ++++++++++++++++++++++++++++++++-------
>  8 files changed, 277 insertions(+), 84 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 08605ce7379d..40ae36904825 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> +#else
> +static inline void call_rcu_flush(struct rcu_head *head,
> +		rcu_callback_t func) {  call_rcu(head, func); }
> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index f53ad63b2bc6..edd632e68497 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default n
> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +	  pressure or callback list growing too big.
> +
>  endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index be5979da07f5..65704cbc9df7 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -474,6 +474,14 @@ enum rcutorture_type {
>  	INVALID_RCU_FLAVOR
>  };
>  
> +#if defined(CONFIG_RCU_LAZY)
> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> +#else
> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> +#endif
> +
>  #if defined(CONFIG_TREE_RCU)
>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>  			    unsigned long *gp_seq);
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index a33a8d4942c3..810479cf17ba 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
>  
>  void rcu_barrier(void)
>  {
> -	wait_rcu_gp(call_rcu);
> +	wait_rcu_gp(call_rcu_flush);
>  }
>  EXPORT_SYMBOL(rcu_barrier);
>  
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 5ec97e3f7468..736d0d724207 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>  	raw_spin_unlock_rcu_node(rnp);
>  }
>  
> -/**
> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
> - * @head: structure to be used for queueing the RCU updates.
> - * @func: actual callback function to be invoked after the grace period
> - *
> - * The callback function will be invoked some time after a full grace
> - * period elapses, in other words after all pre-existing RCU read-side
> - * critical sections have completed.  However, the callback function
> - * might well execute concurrently with RCU read-side critical sections
> - * that started after call_rcu() was invoked.
> - *
> - * RCU read-side critical sections are delimited by rcu_read_lock()
> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
> - * v5.0 and later, regions of code across which interrupts, preemption,
> - * or softirqs have been disabled also serve as RCU read-side critical
> - * sections.  This includes hardware interrupt handlers, softirq handlers,
> - * and NMI handlers.
> - *
> - * Note that all CPUs must agree that the grace period extended beyond
> - * all pre-existing RCU read-side critical section.  On systems with more
> - * than one CPU, this means that when "func()" is invoked, each CPU is
> - * guaranteed to have executed a full memory barrier since the end of its
> - * last RCU read-side critical section whose beginning preceded the call
> - * to call_rcu().  It also means that each CPU executing an RCU read-side
> - * critical section that continues beyond the start of "func()" must have
> - * executed a memory barrier after the call_rcu() but before the beginning
> - * of that RCU read-side critical section.  Note that these guarantees
> - * include CPUs that are offline, idle, or executing in user mode, as
> - * well as CPUs that are executing in the kernel.
> - *
> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> - * resulting RCU callback function "func()", then both CPU A and CPU B are
> - * guaranteed to execute a full memory barrier during the time interval
> - * between the call to call_rcu() and the invocation of "func()" -- even
> - * if CPU A and CPU B are the same CPU (but again only if the system has
> - * more than one CPU).
> - *
> - * Implementation of these memory-ordering guarantees is described here:
> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> - */
> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +static void
> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>  {
>  	static atomic_t doublefrees;
>  	unsigned long flags;
> @@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  	}
>  
>  	check_cb_ovld(rdp);
> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>  		return; // Enqueued onto ->nocb_bypass, so just leave.
>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>  	rcu_segcblist_enqueue(&rdp->cblist, head);
> @@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		local_irq_restore(flags);
>  	}
>  }
> -EXPORT_SYMBOL_GPL(call_rcu);
>  
> +#ifdef CONFIG_RCU_LAZY
> +/**
> + * call_rcu_flush() - Queue RCU callback for invocation after grace period, and
> + * flush all lazy callbacks (including the new one) to the main ->cblist while
> + * doing so.
> + *
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.
> + *
> + * Use this API instead of call_rcu() if you don't mind the callback being
> + * invoked after very long periods of time on systems without memory pressure
> + * and on systems which are lightly loaded or mostly idle.

This comment is backwards, right?  Shouldn't it say something like "Use
this API instead of call_rcu() if you don't mind burning extra power..."?

> + *
> + * Other than the extra delay in callbacks being invoked, this function is
> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
> + * details about memory ordering and other functionality.
> + */
> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, false);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_flush);
> +#endif
> +
> +/**
> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
> + * By default the callbacks are 'lazy' and are kept hidden from the main
> + * ->cblist to prevent starting of grace periods too soon.
> + * If you desire grace periods to start very soon, use call_rcu_flush().
> + *
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.  However, the callback function
> + * might well execute concurrently with RCU read-side critical sections
> + * that started after call_rcu() was invoked.
> + *
> + * RCU read-side critical sections are delimited by rcu_read_lock()
> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
> + * v5.0 and later, regions of code across which interrupts, preemption,
> + * or softirqs have been disabled also serve as RCU read-side critical
> + * sections.  This includes hardware interrupt handlers, softirq handlers,
> + * and NMI handlers.
> + *
> + * Note that all CPUs must agree that the grace period extended beyond
> + * all pre-existing RCU read-side critical section.  On systems with more
> + * than one CPU, this means that when "func()" is invoked, each CPU is
> + * guaranteed to have executed a full memory barrier since the end of its
> + * last RCU read-side critical section whose beginning preceded the call
> + * to call_rcu().  It also means that each CPU executing an RCU read-side
> + * critical section that continues beyond the start of "func()" must have
> + * executed a memory barrier after the call_rcu() but before the beginning
> + * of that RCU read-side critical section.  Note that these guarantees
> + * include CPUs that are offline, idle, or executing in user mode, as
> + * well as CPUs that are executing in the kernel.
> + *
> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> + * resulting RCU callback function "func()", then both CPU A and CPU B are
> + * guaranteed to execute a full memory barrier during the time interval
> + * between the call to call_rcu() and the invocation of "func()" -- even
> + * if CPU A and CPU B are the same CPU (but again only if the system has
> + * more than one CPU).
> + *
> + * Implementation of these memory-ordering guarantees is described here:
> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> + */
> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, true);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu);
>  
>  /* Maximum number of jiffies to wait before draining a batch. */
>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
> @@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
>  		if (rcu_gp_is_expedited())
>  			synchronize_rcu_expedited();
>  		else
> -			wait_rcu_gp(call_rcu);
> +			wait_rcu_gp(call_rcu_flush);
>  		return;
>  	}
>  
> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	rdp->barrier_head.func = rcu_barrier_callback;
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	/*
> +	 * Flush the bypass list, but also wake up the GP thread as otherwise
> +	 * bypass/lazy CBs maynot be noticed, and can cause real long delays!
> +	 */
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
>  	} else {
> @@ -4323,7 +4364,7 @@ void rcutree_migrate_callbacks(int cpu)
>  	my_rdp = this_cpu_ptr(&rcu_data);
>  	my_rnp = my_rdp->mynode;
>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>  	/* Leverage recent GPs and set GP for new callbacks. */
>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index d4a97e40ea9c..361c41d642c7 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -263,14 +263,16 @@ struct rcu_data {
>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
>  
> +	long lazy_len;			/* Length of buffered lazy callbacks. */

Do we ever actually care about the length as opposed to whether or not all
the bypass callbacks are lazy?  If not, a "some_nonlazy" boolean would be
initialed to zero and ORed with the non-laziness of the added callback.
Or, if there was a test anyway, simply set to 1 in the presence of a
non-lazy callback.  And as now, gets zeroed when the bypass is flushed.

This might shorten a few lines of code.

>  	int cpu;
>  };
>  
>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
>  #define RCU_NOCB_WAKE_NOT	0
>  #define RCU_NOCB_WAKE_BYPASS	1
> -#define RCU_NOCB_WAKE		2
> -#define RCU_NOCB_WAKE_FORCE	3
> +#define RCU_NOCB_WAKE_LAZY	2
> +#define RCU_NOCB_WAKE		3
> +#define RCU_NOCB_WAKE_FORCE	4
>  
>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>  					/* For jiffies_till_first_fqs and */
> @@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
>  static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>  static void rcu_init_one_nocb(struct rcu_node *rnp);
> +
> +#define FLUSH_BP_NONE 0
> +/* Is the CB being enqueued after the flush, a lazy CB? */
> +#define FLUSH_BP_LAZY BIT(0)
> +/* Wake up nocb-GP thread after flush? */
> +#define FLUSH_BP_WAKE BIT(1)
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j);
> +				  unsigned long j, unsigned long flush_flags);
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags);
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy);
>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>  				 unsigned long flags);
>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 18e9b4cd78ef..5cac05600798 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
>  
>  	/* If expedited grace periods are prohibited, fall back to normal. */
>  	if (rcu_gp_is_normal()) {
> -		wait_rcu_gp(call_rcu);
> +		wait_rcu_gp(call_rcu_flush);
>  		return;
>  	}
>  
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index f77a6d7e1356..661c685aba3f 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>  }
>  
> +/*
> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> + * could be flushed much earlier for a number of other reasons
> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> + * left unsubmitted to RCU after those many jiffies.
> + */
> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> +
> +#ifdef CONFIG_RCU_LAZY
> +// To be called only from test code.
> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
> +{
> +	jiffies_till_flush = jif;
> +}
> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
> +
> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
> +{
> +	return jiffies_till_flush;
> +}
> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
> +#endif
> +
>  /*
>   * Arrange to wake the GP kthread for this NOCB group at some future
>   * time when it is safe to do so.
> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>  
>  	/*
> -	 * Bypass wakeup overrides previous deferments. In case
> -	 * of callback storm, no need to wake up too early.
> +	 * Bypass wakeup overrides previous deferments. In case of
> +	 * callback storm, no need to wake up too early.
>  	 */
> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
> +	if (waketype == RCU_NOCB_WAKE_LAZY
> +		&& READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {

Please leave the "&&" on the previous line and line up the "READ_ONCE("
with the "waketype".  That makes it easier to tell the condition from
the following code.

> +		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> +	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>  		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>  		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>  	} else {
> @@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>   * proves to be initially empty, just return false because the no-CB GP
>   * kthread may need to be awakened in this case.
>   *
> + * Return true if there was something to be flushed and it succeeded, otherwise
> + * false.
> + *
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				     unsigned long j)
> +				     unsigned long j, unsigned long flush_flags)
>  {
>  	struct rcu_cblist rcl;
> +	bool lazy = flush_flags & FLUSH_BP_LAZY;
>  
>  	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
>  	rcu_lockdep_assert_cblist_protected(rdp);
> @@ -310,7 +343,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
>  	if (rhp)
>  		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
> -	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +
> +	/*
> +	 * If the new CB requested was a lazy one, queue it onto the main
> +	 * ->cblist so we can take advantage of a sooner grade period.

"take advantage of a grace period that will happen regardless."?

> +	 */
> +	if (lazy && rhp) {
> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> +		rcu_cblist_enqueue(&rcl, rhp);

Would it makes sense to enqueue rhp onto ->nocb_bypass first, NULL out
rhp, then let the rcu_cblist_flush_enqueue() be common code?  Or did this
function grow a later use of rhp that I missed?

> +		WRITE_ONCE(rdp->lazy_len, 0);
> +	} else {
> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +		WRITE_ONCE(rdp->lazy_len, 0);

This WRITE_ONCE() can be dropped out of the "if" statement, correct?

If so, this could be an "if" statement with two statements in its "then"
clause, no "else" clause, and two statements following the "if" statement.

> +	}
> +
>  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>  	WRITE_ONCE(rdp->nocb_bypass_first, j);
>  	rcu_nocb_bypass_unlock(rdp);
> @@ -326,13 +372,33 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, unsigned long flush_flags)
>  {
> +	bool ret;
> +	bool was_alldone = false;
> +	bool bypass_all_lazy = false;
> +	long bypass_ncbs;

Alphabetical order by variable name, please.  (Yes, I know that this is
strange, but what can I say?)

> +
>  	if (!rcu_rdp_is_offloaded(rdp))
>  		return true;
>  	rcu_lockdep_assert_cblist_protected(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> +
> +	if (flush_flags & FLUSH_BP_WAKE) {
> +		was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> +		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +		bypass_all_lazy = bypass_ncbs && (bypass_ncbs == rdp->lazy_len);
> +	}
> +
> +	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> +
> +	// Wake up the nocb GP thread if needed. GP thread could be sleeping
> +	// while waiting for lazy timer to expire (otherwise rcu_barrier may
> +	// end up waiting for the duration of the lazy timer).
> +	if (flush_flags & FLUSH_BP_WAKE && was_alldone && bypass_all_lazy)
> +		wake_nocb_gp(rdp, false);
> +
> +	return ret;
>  }
>  
>  /*
> @@ -345,7 +411,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>  	if (!rcu_rdp_is_offloaded(rdp) ||
>  	    !rcu_nocb_bypass_trylock(rdp))
>  		return;
> -	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
> +	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
>  }
>  
>  /*
> @@ -367,12 +433,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>   * there is only one CPU in operation.
>   */
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy)
>  {
>  	unsigned long c;
>  	unsigned long cur_gp_seq;
>  	unsigned long j = jiffies;
>  	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
>  
>  	lockdep_assert_irqs_disabled();
>  
> @@ -417,25 +485,30 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// If there hasn't yet been all that many ->cblist enqueues
>  	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
>  	// ->nocb_bypass first.
> -	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
> +	// Lazy CBs throttle this back and do immediate bypass queuing.
> +	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
>  		rcu_nocb_lock(rdp);
>  		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  		if (*was_alldone)
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("FirstQ"));
> -		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
> +
> +		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
>  		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
>  		return false; // Caller must enqueue the callback.
>  	}
>  
>  	// If ->nocb_bypass has been used too long or is too full,
>  	// flush ->nocb_bypass to ->cblist.
> -	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> +	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> +	    (ncbs &&  bypass_is_lazy &&
> +		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
>  	    ncbs >= qhimark) {
>  		rcu_nocb_lock(rdp);
>  		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  
> -		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
> +		if (!rcu_nocb_flush_bypass(rdp, rhp, j,
> +					   lazy ? FLUSH_BP_LAZY : FLUSH_BP_NONE)) {
>  			if (*was_alldone)
>  				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  						    TPS("FirstQ"));
> @@ -460,16 +533,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// We need to use the bypass.
>  	rcu_nocb_wait_contended(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> +
>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> +
> +	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)

Won't !IS_ENABLED(CONFIG_RCU_LAZY) mean that lazy cannot be set?
Why do we need to check both?  Or are you going for dead code?  If so,
shouldn't there be IS_ENABLED(CONFIG_RCU_LAZY) checks above as well?

Me, I am not convinced that the dead code would buy you much.  In fact,
the compiler might well be propagating the constants on its own.

Ah!  The reason the compiler cannot figure this out is because you put
the switch into rcu.h.  If you instead always export the call_rcu_flush()
definition, and check IS_ENABLED(CONFIG_RCU_LAZY) at the beginning of
call_rcu(), the compiler should have the information that it needs to
do this for you.

> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
> +
>  	if (!ncbs) {
>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>  	}
> +
>  	rcu_nocb_bypass_unlock(rdp);
>  	smp_mb(); /* Order enqueue before wake. */
> -	if (ncbs) {
> +
> +	// A wake up of the grace period kthread or timer adjustment needs to
> +	// be done only if:
> +	// 1. Bypass list was fully empty before (this is the first bypass list entry).
> +	//	Or, both the below conditions are met:
> +	// 1. Bypass list had only lazy CBs before.
> +	// 2. The new CB is non-lazy.
> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>  		local_irq_restore(flags);
>  	} else {
>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
> @@ -499,7 +585,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  {
>  	unsigned long cur_gp_seq;
>  	unsigned long j;
> -	long len;
> +	long len, lazy_len, bypass_len;
>  	struct task_struct *t;

Again, alphabetical please, strange though that might seem.

>  	// If we are being polled or there is no kthread, just leave.
> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  	}
>  	// Need to actually to a wakeup.
>  	len = rcu_segcblist_n_cbs(&rdp->cblist);
> +	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	lazy_len = READ_ONCE(rdp->lazy_len);
>  	if (was_alldone) {
>  		rdp->qlen_last_fqs_check = len;
> -		if (!irqs_disabled_flags(flags)) {
> +		// Only lazy CBs in bypass list
> +		if (lazy_len && bypass_len == lazy_len) {
> +			rcu_nocb_unlock_irqrestore(rdp, flags);
> +			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
> +					   TPS("WakeLazy"));
> +		} else if (!irqs_disabled_flags(flags)) {
>  			/* ... if queue was empty ... */
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			wake_nocb_gp(rdp, false);
> @@ -604,8 +697,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int cpu)
>   */
>  static void nocb_gp_wait(struct rcu_data *my_rdp)
>  {
> -	bool bypass = false;
> -	long bypass_ncbs;
> +	bool bypass = false, lazy = false;
> +	long bypass_ncbs, lazy_ncbs;

And ditto.

>  	int __maybe_unused cpu = my_rdp->cpu;
>  	unsigned long cur_gp_seq;
>  	unsigned long flags;
> @@ -640,24 +733,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	 * won't be ignored for long.
>  	 */
>  	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
> +		bool flush_bypass = false;
> +
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>  		rcu_nocb_lock_irqsave(rdp, flags);
>  		lockdep_assert_held(&rdp->nocb_lock);
>  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> -		if (bypass_ncbs &&
> +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +
> +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> +		     bypass_ncbs > 2 * qhimark)) {
> +			flush_bypass = true;
> +		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
>  		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
>  		     bypass_ncbs > 2 * qhimark)) {
> -			// Bypass full or old, so flush it.
> -			(void)rcu_nocb_try_flush_bypass(rdp, j);
> -			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +			flush_bypass = true;
>  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			continue; /* No callbacks here, try next. */
>  		}
> +
> +		if (flush_bypass) {
> +			// Bypass full or old, so flush it.
> +			(void)rcu_nocb_try_flush_bypass(rdp, j);
> +			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +			lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +		}
> +
>  		if (bypass_ncbs) {
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> -					    TPS("Bypass"));
> -			bypass = true;
> +				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
> +			if (bypass_ncbs == lazy_ncbs)
> +				lazy = true;
> +			else
> +				bypass = true;
>  		}
>  		rnp = rdp->mynode;
>  
> @@ -705,12 +815,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	my_rdp->nocb_gp_gp = needwait_gp;
>  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>  
> -	if (bypass && !rcu_nocb_poll) {
> -		// At least one child with non-empty ->nocb_bypass, so set
> -		// timer in order to avoid stranding its callbacks.
> -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> -				   TPS("WakeBypassIsDeferred"));
> +	// At least one child with non-empty ->nocb_bypass, so set
> +	// timer in order to avoid stranding its callbacks.
> +	if (!rcu_nocb_poll) {
> +		// If bypass list only has lazy CBs. Add a deferred
> +		// lazy wake up.

One sentence rather than two.

> +		if (lazy && !bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> +					TPS("WakeLazyIsDeferred"));
> +		// Otherwise add a deferred bypass wake up.
> +		} else if (bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> +					TPS("WakeBypassIsDeferred"));
> +		}
>  	}
> +
>  	if (rcu_nocb_poll) {
>  		/* Polling, so trace if first poll in the series. */
>  		if (gotcbs)
> @@ -1036,7 +1155,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>  	 * return false, which means that future calls to rcu_nocb_try_bypass()
>  	 * will refuse to put anything into the bypass.
>  	 */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_NONE));
>  	/*
>  	 * Start with invoking rcu_core() early. This way if the current thread
>  	 * happens to preempt an ongoing call to rcu_core() in the middle,
> @@ -1278,6 +1397,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  	raw_spin_lock_init(&rdp->nocb_gp_lock);
>  	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>  	rcu_cblist_init(&rdp->nocb_bypass);
> +	WRITE_ONCE(rdp->lazy_len, 0);
>  	mutex_init(&rdp->nocb_gp_kthread_mutex);
>  }
>  
> @@ -1559,13 +1679,13 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  }
>  
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, unsigned long flush_flags)
>  {
>  	return true;
>  }
>  
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags, bool lazy)
>  {
>  	return false;
>  }
> -- 
> 2.37.3.998.g577e59143f-goog
>

Joel Fernandes Sept. 24, 2022, 4:20 p.m. UTC | #2

Hi Paul,
Let’s see whether my iPhone can handle replies ;-)

More below:

> On Sep 23, 2022, at 5:44 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
>> Implement timer-based RCU lazy callback batching. The batch is flushed
>> whenever a certain amount of time has passed, or the batch on a
>> particular CPU grows too big. Also memory pressure will flush it in a
>> future patch.
>> 
>> To handle several corner cases automagically (such as rcu_barrier() and
>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
>> length has the lazy CB length included in it. A separate lazy CB length
>> counter is also introduced to keep track of the number of lazy CBs.
>> 
>> v5->v6:
>> 
>> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
>>  deferral levels wake much earlier so for those it is not needed. ]
>> 
>> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
>> 
>> [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
>> 
>> [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
>> 
>> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
>> 
>> [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
>> 
>> Suggested-by: Paul McKenney <paulmck@kernel.org>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> I am going to put these on a local branch for testing, but a few comments
> and questions interspersed below.
> 
>                            Thanx, Paul

Ok I replied below, thank you!

>> include/linux/rcupdate.h |   7 ++
>> kernel/rcu/Kconfig       |   8 ++
>> kernel/rcu/rcu.h         |   8 ++
>> kernel/rcu/tiny.c        |   2 +-
>> kernel/rcu/tree.c        | 133 ++++++++++++++++++----------
>> kernel/rcu/tree.h        |  17 +++-
>> kernel/rcu/tree_exp.h    |   2 +-
>> kernel/rcu/tree_nocb.h   | 184 ++++++++++++++++++++++++++++++++-------
>> 8 files changed, 277 insertions(+), 84 deletions(-)
>> 
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 08605ce7379d..40ae36904825 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
>> 
>> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>> 
>> +#ifdef CONFIG_RCU_LAZY
>> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
>> +#else
>> +static inline void call_rcu_flush(struct rcu_head *head,
>> +        rcu_callback_t func) {  call_rcu(head, func); }
>> +#endif
>> +
>> /* Internal to kernel */
>> void rcu_init(void);
>> extern int rcu_scheduler_active;
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index f53ad63b2bc6..edd632e68497 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
>>      Say N here if you hate read-side memory barriers.
>>      Take the default if you are unsure.
>> 
>> +config RCU_LAZY
>> +    bool "RCU callback lazy invocation functionality"
>> +    depends on RCU_NOCB_CPU
>> +    default n
>> +    help
>> +      To save power, batch RCU callbacks and flush after delay, memory
>> +      pressure or callback list growing too big.
>> +
>> endmenu # "RCU Subsystem"
>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>> index be5979da07f5..65704cbc9df7 100644
>> --- a/kernel/rcu/rcu.h
>> +++ b/kernel/rcu/rcu.h
>> @@ -474,6 +474,14 @@ enum rcutorture_type {
>>    INVALID_RCU_FLAVOR
>> };
>> 
>> +#if defined(CONFIG_RCU_LAZY)
>> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
>> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
>> +#else
>> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
>> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
>> +#endif
>> +
>> #if defined(CONFIG_TREE_RCU)
>> void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>>                unsigned long *gp_seq);
>> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
>> index a33a8d4942c3..810479cf17ba 100644
>> --- a/kernel/rcu/tiny.c
>> +++ b/kernel/rcu/tiny.c
>> @@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
>> 
>> void rcu_barrier(void)
>> {
>> -    wait_rcu_gp(call_rcu);
>> +    wait_rcu_gp(call_rcu_flush);
>> }
>> EXPORT_SYMBOL(rcu_barrier);
>> 
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 5ec97e3f7468..736d0d724207 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>>    raw_spin_unlock_rcu_node(rnp);
>> }
>> 
>> -/**
>> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
>> - * @head: structure to be used for queueing the RCU updates.
>> - * @func: actual callback function to be invoked after the grace period
>> - *
>> - * The callback function will be invoked some time after a full grace
>> - * period elapses, in other words after all pre-existing RCU read-side
>> - * critical sections have completed.  However, the callback function
>> - * might well execute concurrently with RCU read-side critical sections
>> - * that started after call_rcu() was invoked.
>> - *
>> - * RCU read-side critical sections are delimited by rcu_read_lock()
>> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
>> - * v5.0 and later, regions of code across which interrupts, preemption,
>> - * or softirqs have been disabled also serve as RCU read-side critical
>> - * sections.  This includes hardware interrupt handlers, softirq handlers,
>> - * and NMI handlers.
>> - *
>> - * Note that all CPUs must agree that the grace period extended beyond
>> - * all pre-existing RCU read-side critical section.  On systems with more
>> - * than one CPU, this means that when "func()" is invoked, each CPU is
>> - * guaranteed to have executed a full memory barrier since the end of its
>> - * last RCU read-side critical section whose beginning preceded the call
>> - * to call_rcu().  It also means that each CPU executing an RCU read-side
>> - * critical section that continues beyond the start of "func()" must have
>> - * executed a memory barrier after the call_rcu() but before the beginning
>> - * of that RCU read-side critical section.  Note that these guarantees
>> - * include CPUs that are offline, idle, or executing in user mode, as
>> - * well as CPUs that are executing in the kernel.
>> - *
>> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>> - * resulting RCU callback function "func()", then both CPU A and CPU B are
>> - * guaranteed to execute a full memory barrier during the time interval
>> - * between the call to call_rcu() and the invocation of "func()" -- even
>> - * if CPU A and CPU B are the same CPU (but again only if the system has
>> - * more than one CPU).
>> - *
>> - * Implementation of these memory-ordering guarantees is described here:
>> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>> - */
>> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
>> +static void
>> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>> {
>>    static atomic_t doublefrees;
>>    unsigned long flags;
>> @@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>    }
>> 
>>    check_cb_ovld(rdp);
>> -    if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
>> +    if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>>        return; // Enqueued onto ->nocb_bypass, so just leave.
>>    // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>>    rcu_segcblist_enqueue(&rdp->cblist, head);
>> @@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>        local_irq_restore(flags);
>>    }
>> }
>> -EXPORT_SYMBOL_GPL(call_rcu);
>> 
>> +#ifdef CONFIG_RCU_LAZY
>> +/**
>> + * call_rcu_flush() - Queue RCU callback for invocation after grace period, and
>> + * flush all lazy callbacks (including the new one) to the main ->cblist while
>> + * doing so.
>> + *
>> + * @head: structure to be used for queueing the RCU updates.
>> + * @func: actual callback function to be invoked after the grace period
>> + *
>> + * The callback function will be invoked some time after a full grace
>> + * period elapses, in other words after all pre-existing RCU read-side
>> + * critical sections have completed.
>> + *
>> + * Use this API instead of call_rcu() if you don't mind the callback being
>> + * invoked after very long periods of time on systems without memory pressure
>> + * and on systems which are lightly loaded or mostly idle.
> 
> This comment is backwards, right?  Shouldn't it say something like "Use
> this API instead of call_rcu() if you don't mind burning extra power..."?

Yes sorry my mistake :-(. It’s a stale comment from the rework and I’ll update it.

> 
>> + *
>> + * Other than the extra delay in callbacks being invoked, this function is
>> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
>> + * details about memory ordering and other functionality.
>> + */
>> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
>> +{
>> +    return __call_rcu_common(head, func, false);
>> +}
>> +EXPORT_SYMBOL_GPL(call_rcu_flush);
>> +#endif
>> +
>> +/**
>> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
>> + * By default the callbacks are 'lazy' and are kept hidden from the main
>> + * ->cblist to prevent starting of grace periods too soon.
>> + * If you desire grace periods to start very soon, use call_rcu_flush().
>> + *
>> + * @head: structure to be used for queueing the RCU updates.
>> + * @func: actual callback function to be invoked after the grace period
>> + *
>> + * The callback function will be invoked some time after a full grace
>> + * period elapses, in other words after all pre-existing RCU read-side
>> + * critical sections have completed.  However, the callback function
>> + * might well execute concurrently with RCU read-side critical sections
>> + * that started after call_rcu() was invoked.
>> + *
>> + * RCU read-side critical sections are delimited by rcu_read_lock()
>> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
>> + * v5.0 and later, regions of code across which interrupts, preemption,
>> + * or softirqs have been disabled also serve as RCU read-side critical
>> + * sections.  This includes hardware interrupt handlers, softirq handlers,
>> + * and NMI handlers.
>> + *
>> + * Note that all CPUs must agree that the grace period extended beyond
>> + * all pre-existing RCU read-side critical section.  On systems with more
>> + * than one CPU, this means that when "func()" is invoked, each CPU is
>> + * guaranteed to have executed a full memory barrier since the end of its
>> + * last RCU read-side critical section whose beginning preceded the call
>> + * to call_rcu().  It also means that each CPU executing an RCU read-side
>> + * critical section that continues beyond the start of "func()" must have
>> + * executed a memory barrier after the call_rcu() but before the beginning
>> + * of that RCU read-side critical section.  Note that these guarantees
>> + * include CPUs that are offline, idle, or executing in user mode, as
>> + * well as CPUs that are executing in the kernel.
>> + *
>> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>> + * resulting RCU callback function "func()", then both CPU A and CPU B are
>> + * guaranteed to execute a full memory barrier during the time interval
>> + * between the call to call_rcu() and the invocation of "func()" -- even
>> + * if CPU A and CPU B are the same CPU (but again only if the system has
>> + * more than one CPU).
>> + *
>> + * Implementation of these memory-ordering guarantees is described here:
>> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>> + */
>> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
>> +{
>> +    return __call_rcu_common(head, func, true);
>> +}
>> +EXPORT_SYMBOL_GPL(call_rcu);
>> 
>> /* Maximum number of jiffies to wait before draining a batch. */
>> #define KFREE_DRAIN_JIFFIES (5 * HZ)
>> @@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
>>        if (rcu_gp_is_expedited())
>>            synchronize_rcu_expedited();
>>        else
>> -            wait_rcu_gp(call_rcu);
>> +            wait_rcu_gp(call_rcu_flush);
>>        return;
>>    }
>> 
>> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>    rdp->barrier_head.func = rcu_barrier_callback;
>>    debug_rcu_head_queue(&rdp->barrier_head);
>>    rcu_nocb_lock(rdp);
>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>> +    /*
>> +     * Flush the bypass list, but also wake up the GP thread as otherwise
>> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
>> +     */
>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
>>    if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>        atomic_inc(&rcu_state.barrier_cpu_count);
>>    } else {
>> @@ -4323,7 +4364,7 @@ void rcutree_migrate_callbacks(int cpu)
>>    my_rdp = this_cpu_ptr(&rcu_data);
>>    my_rnp = my_rdp->mynode;
>>    rcu_nocb_lock(my_rdp); /* irqs already disabled. */
>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
>>    raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>>    /* Leverage recent GPs and set GP for new callbacks. */
>>    needwake = rcu_advance_cbs(my_rnp, rdp) ||
>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>> index d4a97e40ea9c..361c41d642c7 100644
>> --- a/kernel/rcu/tree.h
>> +++ b/kernel/rcu/tree.h
>> @@ -263,14 +263,16 @@ struct rcu_data {
>>    unsigned long last_fqs_resched;    /* Time of last rcu_resched(). */
>>    unsigned long last_sched_clock;    /* Jiffies of last rcu_sched_clock_irq(). */
>> 
>> +    long lazy_len;            /* Length of buffered lazy callbacks. */
> 
> Do we ever actually care about the length as opposed to whether or not all
> the bypass callbacks are lazy?  If not, a "some_nonlazy" boolean would be
> initialed to zero and ORed with the non-laziness of the added callback.
> Or, if there was a test anyway, simply set to 1 in the presence of a
> non-lazy callback.  And as now, gets zeroed when the bypass is flushed.

We had discussed this before, and my point was
we could use it for tracing and future extension
as well. If it’s ok with you, I prefer to keep it this
way than having to come back add the length later.

On a minor point as well, if you really want it this
 way,  I am afraid of changing this now since I tested
 this way for last several iterations and it’s easy to
 add a regression. So I prefer to keep it this way and
 then I can add more patches later on top if that’s
 Ok with you.

> 
> This might shorten a few lines of code.
> 
>>    int cpu;
>> };
>> 
>> /* Values for nocb_defer_wakeup field in struct rcu_data. */
>> #define RCU_NOCB_WAKE_NOT    0
>> #define RCU_NOCB_WAKE_BYPASS    1
>> -#define RCU_NOCB_WAKE        2
>> -#define RCU_NOCB_WAKE_FORCE    3
>> +#define RCU_NOCB_WAKE_LAZY    2
>> +#define RCU_NOCB_WAKE        3
>> +#define RCU_NOCB_WAKE_FORCE    4
>> 
>> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>>                    /* For jiffies_till_first_fqs and */
>> @@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
>> static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>> static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>> static void rcu_init_one_nocb(struct rcu_node *rnp);
>> +
>> +#define FLUSH_BP_NONE 0
>> +/* Is the CB being enqueued after the flush, a lazy CB? */
>> +#define FLUSH_BP_LAZY BIT(0)
>> +/* Wake up nocb-GP thread after flush? */
>> +#define FLUSH_BP_WAKE BIT(1)
>> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -                  unsigned long j);
>> +                  unsigned long j, unsigned long flush_flags);
>> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -                bool *was_alldone, unsigned long flags);
>> +                bool *was_alldone, unsigned long flags,
>> +                bool lazy);
>> static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>>                 unsigned long flags);
>> static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index 18e9b4cd78ef..5cac05600798 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
>> 
>>    /* If expedited grace periods are prohibited, fall back to normal. */
>>    if (rcu_gp_is_normal()) {
>> -        wait_rcu_gp(call_rcu);
>> +        wait_rcu_gp(call_rcu_flush);
>>        return;
>>    }
>> 
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index f77a6d7e1356..661c685aba3f 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>>    return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>> }
>> 
>> +/*
>> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
>> + * can elapse before lazy callbacks are flushed. Lazy callbacks
>> + * could be flushed much earlier for a number of other reasons
>> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
>> + * left unsubmitted to RCU after those many jiffies.
>> + */
>> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
>> +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
>> +
>> +#ifdef CONFIG_RCU_LAZY
>> +// To be called only from test code.
>> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
>> +{
>> +    jiffies_till_flush = jif;
>> +}
>> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
>> +
>> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
>> +{
>> +    return jiffies_till_flush;
>> +}
>> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
>> +#endif
>> +
>> /*
>>  * Arrange to wake the GP kthread for this NOCB group at some future
>>  * time when it is safe to do so.
>> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>    raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>> 
>>    /*
>> -     * Bypass wakeup overrides previous deferments. In case
>> -     * of callback storm, no need to wake up too early.
>> +     * Bypass wakeup overrides previous deferments. In case of
>> +     * callback storm, no need to wake up too early.
>>     */
>> -    if (waketype == RCU_NOCB_WAKE_BYPASS) {
>> +    if (waketype == RCU_NOCB_WAKE_LAZY
>> +        && READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
> 
> Please leave the "&&" on the previous line and line up the "READ_ONCE("
> with the "waketype".  That makes it easier to tell the condition from
> the following code.

Will do!

> 
>> +        mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
>> +        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>> +    } else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>        mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>>        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>    } else {
>> @@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>  * proves to be initially empty, just return false because the no-CB GP
>>  * kthread may need to be awakened in this case.
>>  *
>> + * Return true if there was something to be flushed and it succeeded, otherwise
>> + * false.
>> + *
>>  * Note that this function always returns true if rhp is NULL.
>>  */
>> static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -                     unsigned long j)
>> +                     unsigned long j, unsigned long flush_flags)
>> {
>>    struct rcu_cblist rcl;
>> +    bool lazy = flush_flags & FLUSH_BP_LAZY;
>> 
>>    WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
>>    rcu_lockdep_assert_cblist_protected(rdp);
>> @@ -310,7 +343,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>    /* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
>>    if (rhp)
>>        rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>> -    rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>> +
>> +    /*
>> +     * If the new CB requested was a lazy one, queue it onto the main
>> +     * ->cblist so we can take advantage of a sooner grade period.
> 
> "take advantage of a grace period that will happen regardless."?

Sure will update.

> 
>> +     */
>> +    if (lazy && rhp) {
>> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
>> +        rcu_cblist_enqueue(&rcl, rhp);
> 
> Would it makes sense to enqueue rhp onto ->nocb_bypass first, NULL out
> rhp, then let the rcu_cblist_flush_enqueue() be common code?  Or did this
> function grow a later use of rhp that I missed?

No that could be done, but it prefer to keep it this
 way because rhp is a function parameter and I
prefer not to modify those since it could add a
bug in future where rhp passed by user is now
NULL for some reason, half way through the
function.

> 
>> +        WRITE_ONCE(rdp->lazy_len, 0);
>> +    } else {
>> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>> +        WRITE_ONCE(rdp->lazy_len, 0);
> 
> This WRITE_ONCE() can be dropped out of the "if" statement, correct?

Yes will update.

> 
> If so, this could be an "if" statement with two statements in its "then"
> clause, no "else" clause, and two statements following the "if" statement.

I don’t think we can get rid of the else part but I’ll see what it looks like.
> 
>> +    }
>> +
>>    rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>>    WRITE_ONCE(rdp->nocb_bypass_first, j);
>>    rcu_nocb_bypass_unlock(rdp);
>> @@ -326,13 +372,33 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  * Note that this function always returns true if rhp is NULL.
>>  */
>> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -                  unsigned long j)
>> +                  unsigned long j, unsigned long flush_flags)
>> {
>> +    bool ret;
>> +    bool was_alldone = false;
>> +    bool bypass_all_lazy = false;
>> +    long bypass_ncbs;
> 
> Alphabetical order by variable name, please.  (Yes, I know that this is
> strange, but what can I say?)

Sure.

> 
>> +
>>    if (!rcu_rdp_is_offloaded(rdp))
>>        return true;
>>    rcu_lockdep_assert_cblist_protected(rdp);
>>    rcu_nocb_bypass_lock(rdp);
>> -    return rcu_nocb_do_flush_bypass(rdp, rhp, j);
>> +
>> +    if (flush_flags & FLUSH_BP_WAKE) {
>> +        was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>> +        bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +        bypass_all_lazy = bypass_ncbs && (bypass_ncbs == rdp->lazy_len);
>> +    }
>> +
>> +    ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
>> +
>> +    // Wake up the nocb GP thread if needed. GP thread could be sleeping
>> +    // while waiting for lazy timer to expire (otherwise rcu_barrier may
>> +    // end up waiting for the duration of the lazy timer).
>> +    if (flush_flags & FLUSH_BP_WAKE && was_alldone && bypass_all_lazy)
>> +        wake_nocb_gp(rdp, false);
>> +
>> +    return ret;
>> }
>> 
>> /*
>> @@ -345,7 +411,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>>    if (!rcu_rdp_is_offloaded(rdp) ||
>>        !rcu_nocb_bypass_trylock(rdp))
>>        return;
>> -    WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
>> +    WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
>> }
>> 
>> /*
>> @@ -367,12 +433,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>>  * there is only one CPU in operation.
>>  */
>> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -                bool *was_alldone, unsigned long flags)
>> +                bool *was_alldone, unsigned long flags,
>> +                bool lazy)
>> {
>>    unsigned long c;
>>    unsigned long cur_gp_seq;
>>    unsigned long j = jiffies;
>>    long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +    bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
>> 
>>    lockdep_assert_irqs_disabled();
>> 
>> @@ -417,25 +485,30 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>    // If there hasn't yet been all that many ->cblist enqueues
>>    // this jiffy, tell the caller to enqueue onto ->cblist.  But flush
>>    // ->nocb_bypass first.
>> -    if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
>> +    // Lazy CBs throttle this back and do immediate bypass queuing.
>> +    if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
>>        rcu_nocb_lock(rdp);
>>        *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>>        if (*was_alldone)
>>            trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>                        TPS("FirstQ"));
>> -        WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
>> +
>> +        WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
>>        WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
>>        return false; // Caller must enqueue the callback.
>>    }
>> 
>>    // If ->nocb_bypass has been used too long or is too full,
>>    // flush ->nocb_bypass to ->cblist.
>> -    if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
>> +    if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
>> +        (ncbs &&  bypass_is_lazy &&
>> +        (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
>>        ncbs >= qhimark) {
>>        rcu_nocb_lock(rdp);
>>        *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>> 
>> -        if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
>> +        if (!rcu_nocb_flush_bypass(rdp, rhp, j,
>> +                       lazy ? FLUSH_BP_LAZY : FLUSH_BP_NONE)) {
>>            if (*was_alldone)
>>                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>                            TPS("FirstQ"));
>> @@ -460,16 +533,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>    // We need to use the bypass.
>>    rcu_nocb_wait_contended(rdp);
>>    rcu_nocb_bypass_lock(rdp);
>> +
>>    ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>    rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>>    rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>> +
>> +    if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
> 
> Won't !IS_ENABLED(CONFIG_RCU_LAZY) mean that lazy cannot be set?
> Why do we need to check both?  Or are you going for dead code?  If so,
> shouldn't there be IS_ENABLED(CONFIG_RCU_LAZY) checks above as well?
> 
> Me, I am not convinced that the dead code would buy you much.  In fact,
> the compiler might well be propagating the constants on its own.
> 
> Ah!  The reason the compiler cannot figure this out is because you put
> the switch into rcu.h.  If you instead always export the call_rcu_flush()
> definition, and check IS_ENABLED(CONFIG_RCU_LAZY) at the beginning of
> call_rcu(), the compiler should have the information that it needs to
> do this for you.

Ah ok, I will try to do it this way.

> 
>> +        WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
>> +
>>    if (!ncbs) {
>>        WRITE_ONCE(rdp->nocb_bypass_first, j);
>>        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>>    }
>> +
>>    rcu_nocb_bypass_unlock(rdp);
>>    smp_mb(); /* Order enqueue before wake. */
>> -    if (ncbs) {
>> +
>> +    // A wake up of the grace period kthread or timer adjustment needs to
>> +    // be done only if:
>> +    // 1. Bypass list was fully empty before (this is the first bypass list entry).
>> +    //    Or, both the below conditions are met:
>> +    // 1. Bypass list had only lazy CBs before.
>> +    // 2. The new CB is non-lazy.
>> +    if (ncbs && (!bypass_is_lazy || lazy)) {
>>        local_irq_restore(flags);
>>    } else {
>>        // No-CBs GP kthread might be indefinitely asleep, if so, wake.
>> @@ -499,7 +585,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>> {
>>    unsigned long cur_gp_seq;
>>    unsigned long j;
>> -    long len;
>> +    long len, lazy_len, bypass_len;
>>    struct task_struct *t;
> 
> Again, alphabetical please, strange though that might seem.

Yes sure

> 
>>    // If we are being polled or there is no kthread, just leave.
>> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>>    }
>>    // Need to actually to a wakeup.
>>    len = rcu_segcblist_n_cbs(&rdp->cblist);
>> +    bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +    lazy_len = READ_ONCE(rdp->lazy_len);
>>    if (was_alldone) {
>>        rdp->qlen_last_fqs_check = len;
>> -        if (!irqs_disabled_flags(flags)) {
>> +        // Only lazy CBs in bypass list
>> +        if (lazy_len && bypass_len == lazy_len) {
>> +            rcu_nocb_unlock_irqrestore(rdp, flags);
>> +            wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
>> +                       TPS("WakeLazy"));
>> +        } else if (!irqs_disabled_flags(flags)) {
>>            /* ... if queue was empty ... */
>>            rcu_nocb_unlock_irqrestore(rdp, flags);
>>            wake_nocb_gp(rdp, false);
>> @@ -604,8 +697,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int cpu)
>>  */
>> static void nocb_gp_wait(struct rcu_data *my_rdp)
>> {
>> -    bool bypass = false;
>> -    long bypass_ncbs;
>> +    bool bypass = false, lazy = false;
>> +    long bypass_ncbs, lazy_ncbs;
> 
> And ditto.

Ok

> 
>>    int __maybe_unused cpu = my_rdp->cpu;
>>    unsigned long cur_gp_seq;
>>    unsigned long flags;
>> @@ -640,24 +733,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>     * won't be ignored for long.
>>     */
>>    list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
>> +        bool flush_bypass = false;
>> +
>>        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>>        rcu_nocb_lock_irqsave(rdp, flags);
>>        lockdep_assert_held(&rdp->nocb_lock);
>>        bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> -        if (bypass_ncbs &&
>> +        lazy_ncbs = READ_ONCE(rdp->lazy_len);
>> +
>> +        if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
>> +            (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
>> +             bypass_ncbs > 2 * qhimark)) {
>> +            flush_bypass = true;
>> +        } else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
>>            (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
>>             bypass_ncbs > 2 * qhimark)) {
>> -            // Bypass full or old, so flush it.
>> -            (void)rcu_nocb_try_flush_bypass(rdp, j);
>> -            bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +            flush_bypass = true;
>>        } else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
>>            rcu_nocb_unlock_irqrestore(rdp, flags);
>>            continue; /* No callbacks here, try next. */
>>        }
>> +
>> +        if (flush_bypass) {
>> +            // Bypass full or old, so flush it.
>> +            (void)rcu_nocb_try_flush_bypass(rdp, j);
>> +            bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +            lazy_ncbs = READ_ONCE(rdp->lazy_len);
>> +        }
>> +
>>        if (bypass_ncbs) {
>>            trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>> -                        TPS("Bypass"));
>> -            bypass = true;
>> +                    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
>> +            if (bypass_ncbs == lazy_ncbs)
>> +                lazy = true;
>> +            else
>> +                bypass = true;
>>        }
>>        rnp = rdp->mynode;
>> 
>> @@ -705,12 +815,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>    my_rdp->nocb_gp_gp = needwait_gp;
>>    my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>> 
>> -    if (bypass && !rcu_nocb_poll) {
>> -        // At least one child with non-empty ->nocb_bypass, so set
>> -        // timer in order to avoid stranding its callbacks.
>> -        wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>> -                   TPS("WakeBypassIsDeferred"));
>> +    // At least one child with non-empty ->nocb_bypass, so set
>> +    // timer in order to avoid stranding its callbacks.
>> +    if (!rcu_nocb_poll) {
>> +        // If bypass list only has lazy CBs. Add a deferred
>> +        // lazy wake up.
> 
> One sentence rather than two.

Ok

> 
>> +        if (lazy && !bypass) {
>> +            wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
>> +                    TPS("WakeLazyIsDeferred"));
>> +        // Otherwise add a deferred bypass wake up.
>> +        } else if (bypass) {
>> +            wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>> +                    TPS("WakeBypassIsDeferred"));
>> +        }
>>    }
>> +
>>    if (rcu_nocb_poll) {
>>        /* Polling, so trace if first poll in the series. */
>>        if (gotcbs)
>> @@ -1036,7 +1155,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>>     * return false, which means that future calls to rcu_nocb_try_bypass()
>>     * will refuse to put anything into the bypass.
>>     */
>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_NONE));
>>    /*
>>     * Start with invoking rcu_core() early. This way if the current thread
>>     * happens to preempt an ongoing call to rcu_core() in the middle,
>> @@ -1278,6 +1397,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>>    raw_spin_lock_init(&rdp->nocb_gp_lock);
>>    timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>>    rcu_cblist_init(&rdp->nocb_bypass);
>> +    WRITE_ONCE(rdp->lazy_len, 0);
>>    mutex_init(&rdp->nocb_gp_kthread_mutex);
>> }
>> 
>> @@ -1559,13 +1679,13 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>> }
>> 
>> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -                  unsigned long j)
>> +                  unsigned long j, unsigned long flush_flags)
>> {
>>    return true;
>> }
>> 
>> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -                bool *was_alldone, unsigned long flags)
>> +                bool *was_alldone, unsigned long flags, bool lazy)
>> {
>>    return false;
>> }
>> -- 
>> 2.37.3.998.g577e59143f-goog
>>

Paul E. McKenney Sept. 24, 2022, 9:11 p.m. UTC | #3

On Sat, Sep 24, 2022 at 12:20:00PM -0400, Joel Fernandes wrote:
> Hi Paul,
> Let’s see whether my iPhone can handle replies ;-)

Now let's see if it can handle replies to replies!  Are you using a
keyboard, or small-screen finger gestures?

> More below:
> 
> > On Sep 23, 2022, at 5:44 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
> >> Implement timer-based RCU lazy callback batching. The batch is flushed
> >> whenever a certain amount of time has passed, or the batch on a
> >> particular CPU grows too big. Also memory pressure will flush it in a
> >> future patch.
> >> 
> >> To handle several corner cases automagically (such as rcu_barrier() and
> >> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> >> length has the lazy CB length included in it. A separate lazy CB length
> >> counter is also introduced to keep track of the number of lazy CBs.
> >> 
> >> v5->v6:
> >> 
> >> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> >>  deferral levels wake much earlier so for those it is not needed. ]
> >> 
> >> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> >> 
> >> [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> >> 
> >> [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> >> 
> >> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> >> 
> >> [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> >> 
> >> Suggested-by: Paul McKenney <paulmck@kernel.org>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > I am going to put these on a local branch for testing, but a few comments
> > and questions interspersed below.
> > 
> >                            Thanx, Paul
> 
> Ok I replied below, thank you!
> 
> >> include/linux/rcupdate.h |   7 ++
> >> kernel/rcu/Kconfig       |   8 ++
> >> kernel/rcu/rcu.h         |   8 ++
> >> kernel/rcu/tiny.c        |   2 +-
> >> kernel/rcu/tree.c        | 133 ++++++++++++++++++----------
> >> kernel/rcu/tree.h        |  17 +++-
> >> kernel/rcu/tree_exp.h    |   2 +-
> >> kernel/rcu/tree_nocb.h   | 184 ++++++++++++++++++++++++++++++++-------
> >> 8 files changed, 277 insertions(+), 84 deletions(-)
> >> 
> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >> index 08605ce7379d..40ae36904825 100644
> >> --- a/include/linux/rcupdate.h
> >> +++ b/include/linux/rcupdate.h
> >> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> >> 
> >> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> >> 
> >> +#ifdef CONFIG_RCU_LAZY
> >> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> >> +#else
> >> +static inline void call_rcu_flush(struct rcu_head *head,
> >> +        rcu_callback_t func) {  call_rcu(head, func); }
> >> +#endif
> >> +
> >> /* Internal to kernel */
> >> void rcu_init(void);
> >> extern int rcu_scheduler_active;
> >> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> >> index f53ad63b2bc6..edd632e68497 100644
> >> --- a/kernel/rcu/Kconfig
> >> +++ b/kernel/rcu/Kconfig
> >> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> >>      Say N here if you hate read-side memory barriers.
> >>      Take the default if you are unsure.
> >> 
> >> +config RCU_LAZY
> >> +    bool "RCU callback lazy invocation functionality"
> >> +    depends on RCU_NOCB_CPU
> >> +    default n
> >> +    help
> >> +      To save power, batch RCU callbacks and flush after delay, memory
> >> +      pressure or callback list growing too big.
> >> +
> >> endmenu # "RCU Subsystem"
> >> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> >> index be5979da07f5..65704cbc9df7 100644
> >> --- a/kernel/rcu/rcu.h
> >> +++ b/kernel/rcu/rcu.h
> >> @@ -474,6 +474,14 @@ enum rcutorture_type {
> >>    INVALID_RCU_FLAVOR
> >> };
> >> 
> >> +#if defined(CONFIG_RCU_LAZY)
> >> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> >> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> >> +#else
> >> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> >> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> >> +#endif
> >> +
> >> #if defined(CONFIG_TREE_RCU)
> >> void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
> >>                unsigned long *gp_seq);
> >> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> >> index a33a8d4942c3..810479cf17ba 100644
> >> --- a/kernel/rcu/tiny.c
> >> +++ b/kernel/rcu/tiny.c
> >> @@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
> >> 
> >> void rcu_barrier(void)
> >> {
> >> -    wait_rcu_gp(call_rcu);
> >> +    wait_rcu_gp(call_rcu_flush);
> >> }
> >> EXPORT_SYMBOL(rcu_barrier);
> >> 
> >> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> index 5ec97e3f7468..736d0d724207 100644
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
> >>    raw_spin_unlock_rcu_node(rnp);
> >> }
> >> 
> >> -/**
> >> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
> >> - * @head: structure to be used for queueing the RCU updates.
> >> - * @func: actual callback function to be invoked after the grace period
> >> - *
> >> - * The callback function will be invoked some time after a full grace
> >> - * period elapses, in other words after all pre-existing RCU read-side
> >> - * critical sections have completed.  However, the callback function
> >> - * might well execute concurrently with RCU read-side critical sections
> >> - * that started after call_rcu() was invoked.
> >> - *
> >> - * RCU read-side critical sections are delimited by rcu_read_lock()
> >> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
> >> - * v5.0 and later, regions of code across which interrupts, preemption,
> >> - * or softirqs have been disabled also serve as RCU read-side critical
> >> - * sections.  This includes hardware interrupt handlers, softirq handlers,
> >> - * and NMI handlers.
> >> - *
> >> - * Note that all CPUs must agree that the grace period extended beyond
> >> - * all pre-existing RCU read-side critical section.  On systems with more
> >> - * than one CPU, this means that when "func()" is invoked, each CPU is
> >> - * guaranteed to have executed a full memory barrier since the end of its
> >> - * last RCU read-side critical section whose beginning preceded the call
> >> - * to call_rcu().  It also means that each CPU executing an RCU read-side
> >> - * critical section that continues beyond the start of "func()" must have
> >> - * executed a memory barrier after the call_rcu() but before the beginning
> >> - * of that RCU read-side critical section.  Note that these guarantees
> >> - * include CPUs that are offline, idle, or executing in user mode, as
> >> - * well as CPUs that are executing in the kernel.
> >> - *
> >> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> >> - * resulting RCU callback function "func()", then both CPU A and CPU B are
> >> - * guaranteed to execute a full memory barrier during the time interval
> >> - * between the call to call_rcu() and the invocation of "func()" -- even
> >> - * if CPU A and CPU B are the same CPU (but again only if the system has
> >> - * more than one CPU).
> >> - *
> >> - * Implementation of these memory-ordering guarantees is described here:
> >> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> >> - */
> >> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
> >> +static void
> >> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
> >> {
> >>    static atomic_t doublefrees;
> >>    unsigned long flags;
> >> @@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
> >>    }
> >> 
> >>    check_cb_ovld(rdp);
> >> -    if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
> >> +    if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
> >>        return; // Enqueued onto ->nocb_bypass, so just leave.
> >>    // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
> >>    rcu_segcblist_enqueue(&rdp->cblist, head);
> >> @@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
> >>        local_irq_restore(flags);
> >>    }
> >> }
> >> -EXPORT_SYMBOL_GPL(call_rcu);
> >> 
> >> +#ifdef CONFIG_RCU_LAZY
> >> +/**
> >> + * call_rcu_flush() - Queue RCU callback for invocation after grace period, and
> >> + * flush all lazy callbacks (including the new one) to the main ->cblist while
> >> + * doing so.
> >> + *
> >> + * @head: structure to be used for queueing the RCU updates.
> >> + * @func: actual callback function to be invoked after the grace period
> >> + *
> >> + * The callback function will be invoked some time after a full grace
> >> + * period elapses, in other words after all pre-existing RCU read-side
> >> + * critical sections have completed.
> >> + *
> >> + * Use this API instead of call_rcu() if you don't mind the callback being
> >> + * invoked after very long periods of time on systems without memory pressure
> >> + * and on systems which are lightly loaded or mostly idle.
> > 
> > This comment is backwards, right?  Shouldn't it say something like "Use
> > this API instead of call_rcu() if you don't mind burning extra power..."?
> 
> Yes sorry my mistake :-(. It’s a stale comment from the rework and I’ll update it.

Very good, thank you!

> >> + *
> >> + * Other than the extra delay in callbacks being invoked, this function is
> >> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
> >> + * details about memory ordering and other functionality.
> >> + */
> >> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
> >> +{
> >> +    return __call_rcu_common(head, func, false);
> >> +}
> >> +EXPORT_SYMBOL_GPL(call_rcu_flush);
> >> +#endif
> >> +
> >> +/**
> >> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
> >> + * By default the callbacks are 'lazy' and are kept hidden from the main
> >> + * ->cblist to prevent starting of grace periods too soon.
> >> + * If you desire grace periods to start very soon, use call_rcu_flush().
> >> + *
> >> + * @head: structure to be used for queueing the RCU updates.
> >> + * @func: actual callback function to be invoked after the grace period
> >> + *
> >> + * The callback function will be invoked some time after a full grace
> >> + * period elapses, in other words after all pre-existing RCU read-side
> >> + * critical sections have completed.  However, the callback function
> >> + * might well execute concurrently with RCU read-side critical sections
> >> + * that started after call_rcu() was invoked.
> >> + *
> >> + * RCU read-side critical sections are delimited by rcu_read_lock()
> >> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
> >> + * v5.0 and later, regions of code across which interrupts, preemption,
> >> + * or softirqs have been disabled also serve as RCU read-side critical
> >> + * sections.  This includes hardware interrupt handlers, softirq handlers,
> >> + * and NMI handlers.
> >> + *
> >> + * Note that all CPUs must agree that the grace period extended beyond
> >> + * all pre-existing RCU read-side critical section.  On systems with more
> >> + * than one CPU, this means that when "func()" is invoked, each CPU is
> >> + * guaranteed to have executed a full memory barrier since the end of its
> >> + * last RCU read-side critical section whose beginning preceded the call
> >> + * to call_rcu().  It also means that each CPU executing an RCU read-side
> >> + * critical section that continues beyond the start of "func()" must have
> >> + * executed a memory barrier after the call_rcu() but before the beginning
> >> + * of that RCU read-side critical section.  Note that these guarantees
> >> + * include CPUs that are offline, idle, or executing in user mode, as
> >> + * well as CPUs that are executing in the kernel.
> >> + *
> >> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> >> + * resulting RCU callback function "func()", then both CPU A and CPU B are
> >> + * guaranteed to execute a full memory barrier during the time interval
> >> + * between the call to call_rcu() and the invocation of "func()" -- even
> >> + * if CPU A and CPU B are the same CPU (but again only if the system has
> >> + * more than one CPU).
> >> + *
> >> + * Implementation of these memory-ordering guarantees is described here:
> >> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> >> + */
> >> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
> >> +{
> >> +    return __call_rcu_common(head, func, true);
> >> +}
> >> +EXPORT_SYMBOL_GPL(call_rcu);
> >> 
> >> /* Maximum number of jiffies to wait before draining a batch. */
> >> #define KFREE_DRAIN_JIFFIES (5 * HZ)
> >> @@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
> >>        if (rcu_gp_is_expedited())
> >>            synchronize_rcu_expedited();
> >>        else
> >> -            wait_rcu_gp(call_rcu);
> >> +            wait_rcu_gp(call_rcu_flush);
> >>        return;
> >>    }
> >> 
> >> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> >>    rdp->barrier_head.func = rcu_barrier_callback;
> >>    debug_rcu_head_queue(&rdp->barrier_head);
> >>    rcu_nocb_lock(rdp);
> >> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> >> +    /*
> >> +     * Flush the bypass list, but also wake up the GP thread as otherwise
> >> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
> >> +     */
> >> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
> >>    if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
> >>        atomic_inc(&rcu_state.barrier_cpu_count);
> >>    } else {
> >> @@ -4323,7 +4364,7 @@ void rcutree_migrate_callbacks(int cpu)
> >>    my_rdp = this_cpu_ptr(&rcu_data);
> >>    my_rnp = my_rdp->mynode;
> >>    rcu_nocb_lock(my_rdp); /* irqs already disabled. */
> >> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
> >> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
> >>    raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
> >>    /* Leverage recent GPs and set GP for new callbacks. */
> >>    needwake = rcu_advance_cbs(my_rnp, rdp) ||
> >> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> >> index d4a97e40ea9c..361c41d642c7 100644
> >> --- a/kernel/rcu/tree.h
> >> +++ b/kernel/rcu/tree.h
> >> @@ -263,14 +263,16 @@ struct rcu_data {
> >>    unsigned long last_fqs_resched;    /* Time of last rcu_resched(). */
> >>    unsigned long last_sched_clock;    /* Jiffies of last rcu_sched_clock_irq(). */
> >> 
> >> +    long lazy_len;            /* Length of buffered lazy callbacks. */
> > 
> > Do we ever actually care about the length as opposed to whether or not all
> > the bypass callbacks are lazy?  If not, a "some_nonlazy" boolean would be
> > initialed to zero and ORed with the non-laziness of the added callback.
> > Or, if there was a test anyway, simply set to 1 in the presence of a
> > non-lazy callback.  And as now, gets zeroed when the bypass is flushed.
> 
> We had discussed this before, and my point was
> we could use it for tracing and future extension
> as well. If it’s ok with you, I prefer to keep it this
> way than having to come back add the length later.
> 
> On a minor point as well, if you really want it this
>  way,  I am afraid of changing this now since I tested
>  this way for last several iterations and it’s easy to
>  add a regression. So I prefer to keep it this way and
>  then I can add more patches later on top if that’s
>  Ok with you.

No, you are right, the debug information might be helpful.

But hey, at least I am consistent!

> > This might shorten a few lines of code.
> > 
> >>    int cpu;
> >> };
> >> 
> >> /* Values for nocb_defer_wakeup field in struct rcu_data. */
> >> #define RCU_NOCB_WAKE_NOT    0
> >> #define RCU_NOCB_WAKE_BYPASS    1
> >> -#define RCU_NOCB_WAKE        2
> >> -#define RCU_NOCB_WAKE_FORCE    3
> >> +#define RCU_NOCB_WAKE_LAZY    2
> >> +#define RCU_NOCB_WAKE        3
> >> +#define RCU_NOCB_WAKE_FORCE    4
> >> 
> >> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> >>                    /* For jiffies_till_first_fqs and */
> >> @@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
> >> static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
> >> static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
> >> static void rcu_init_one_nocb(struct rcu_node *rnp);
> >> +
> >> +#define FLUSH_BP_NONE 0
> >> +/* Is the CB being enqueued after the flush, a lazy CB? */
> >> +#define FLUSH_BP_LAZY BIT(0)
> >> +/* Wake up nocb-GP thread after flush? */
> >> +#define FLUSH_BP_WAKE BIT(1)
> >> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -                  unsigned long j);
> >> +                  unsigned long j, unsigned long flush_flags);
> >> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -                bool *was_alldone, unsigned long flags);
> >> +                bool *was_alldone, unsigned long flags,
> >> +                bool lazy);
> >> static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
> >>                 unsigned long flags);
> >> static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
> >> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> >> index 18e9b4cd78ef..5cac05600798 100644
> >> --- a/kernel/rcu/tree_exp.h
> >> +++ b/kernel/rcu/tree_exp.h
> >> @@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
> >> 
> >>    /* If expedited grace periods are prohibited, fall back to normal. */
> >>    if (rcu_gp_is_normal()) {
> >> -        wait_rcu_gp(call_rcu);
> >> +        wait_rcu_gp(call_rcu_flush);
> >>        return;
> >>    }
> >> 
> >> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> >> index f77a6d7e1356..661c685aba3f 100644
> >> --- a/kernel/rcu/tree_nocb.h
> >> +++ b/kernel/rcu/tree_nocb.h
> >> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> >>    return __wake_nocb_gp(rdp_gp, rdp, force, flags);
> >> }
> >> 
> >> +/*
> >> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> >> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> >> + * could be flushed much earlier for a number of other reasons
> >> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> >> + * left unsubmitted to RCU after those many jiffies.
> >> + */
> >> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> >> +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> >> +
> >> +#ifdef CONFIG_RCU_LAZY
> >> +// To be called only from test code.
> >> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
> >> +{
> >> +    jiffies_till_flush = jif;
> >> +}
> >> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
> >> +
> >> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
> >> +{
> >> +    return jiffies_till_flush;
> >> +}
> >> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
> >> +#endif
> >> +
> >> /*
> >>  * Arrange to wake the GP kthread for this NOCB group at some future
> >>  * time when it is safe to do so.
> >> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
> >>    raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
> >> 
> >>    /*
> >> -     * Bypass wakeup overrides previous deferments. In case
> >> -     * of callback storm, no need to wake up too early.
> >> +     * Bypass wakeup overrides previous deferments. In case of
> >> +     * callback storm, no need to wake up too early.
> >>     */
> >> -    if (waketype == RCU_NOCB_WAKE_BYPASS) {
> >> +    if (waketype == RCU_NOCB_WAKE_LAZY
> >> +        && READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
> > 
> > Please leave the "&&" on the previous line and line up the "READ_ONCE("
> > with the "waketype".  That makes it easier to tell the condition from
> > the following code.
> 
> Will do!
> 
> > 
> >> +        mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
> >> +        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> >> +    } else if (waketype == RCU_NOCB_WAKE_BYPASS) {
> >>        mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
> >>        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> >>    } else {
> >> @@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
> >>  * proves to be initially empty, just return false because the no-CB GP
> >>  * kthread may need to be awakened in this case.
> >>  *
> >> + * Return true if there was something to be flushed and it succeeded, otherwise
> >> + * false.
> >> + *
> >>  * Note that this function always returns true if rhp is NULL.
> >>  */
> >> static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -                     unsigned long j)
> >> +                     unsigned long j, unsigned long flush_flags)
> >> {
> >>    struct rcu_cblist rcl;
> >> +    bool lazy = flush_flags & FLUSH_BP_LAZY;
> >> 
> >>    WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
> >>    rcu_lockdep_assert_cblist_protected(rdp);
> >> @@ -310,7 +343,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >>    /* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
> >>    if (rhp)
> >>        rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
> >> -    rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> >> +
> >> +    /*
> >> +     * If the new CB requested was a lazy one, queue it onto the main
> >> +     * ->cblist so we can take advantage of a sooner grade period.
> > 
> > "take advantage of a grace period that will happen regardless."?
> 
> Sure will update.
> 
> > 
> >> +     */
> >> +    if (lazy && rhp) {
> >> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> >> +        rcu_cblist_enqueue(&rcl, rhp);
> > 
> > Would it makes sense to enqueue rhp onto ->nocb_bypass first, NULL out
> > rhp, then let the rcu_cblist_flush_enqueue() be common code?  Or did this
> > function grow a later use of rhp that I missed?
> 
> No that could be done, but it prefer to keep it this
>  way because rhp is a function parameter and I
> prefer not to modify those since it could add a
> bug in future where rhp passed by user is now
> NULL for some reason, half way through the
> function.

I agree that changing a function parameter is bad practice.

So the question becomes whether introducing a local would outweigh
consolidating this code.  Could you please at least give it a shot?

> >> +        WRITE_ONCE(rdp->lazy_len, 0);
> >> +    } else {
> >> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > 
> > This WRITE_ONCE() can be dropped out of the "if" statement, correct?
> 
> Yes will update.

Thank you!

> > If so, this could be an "if" statement with two statements in its "then"
> > clause, no "else" clause, and two statements following the "if" statement.
> 
> I don’t think we can get rid of the else part but I’ll see what it looks like.

In the function header, s/rhp/rhp_in/, then:

	struct rcu_head *rhp = rhp_in;

And then:

	if (lazy && rhp) {
		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
		rhp = NULL;
	}
	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
	WRITE_ONCE(rdp->lazy_len, 0);

Or did I mess something up?

> >> +    }
> >> +
> >>    rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
> >>    WRITE_ONCE(rdp->nocb_bypass_first, j);
> >>    rcu_nocb_bypass_unlock(rdp);
> >> @@ -326,13 +372,33 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >>  * Note that this function always returns true if rhp is NULL.
> >>  */
> >> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -                  unsigned long j)
> >> +                  unsigned long j, unsigned long flush_flags)
> >> {
> >> +    bool ret;
> >> +    bool was_alldone = false;
> >> +    bool bypass_all_lazy = false;
> >> +    long bypass_ncbs;
> > 
> > Alphabetical order by variable name, please.  (Yes, I know that this is
> > strange, but what can I say?)
> 
> Sure.

Thank you!

> >> +
> >>    if (!rcu_rdp_is_offloaded(rdp))
> >>        return true;
> >>    rcu_lockdep_assert_cblist_protected(rdp);
> >>    rcu_nocb_bypass_lock(rdp);
> >> -    return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> >> +
> >> +    if (flush_flags & FLUSH_BP_WAKE) {
> >> +        was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> >> +        bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> >> +        bypass_all_lazy = bypass_ncbs && (bypass_ncbs == rdp->lazy_len);
> >> +    }
> >> +
> >> +    ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> >> +
> >> +    // Wake up the nocb GP thread if needed. GP thread could be sleeping
> >> +    // while waiting for lazy timer to expire (otherwise rcu_barrier may
> >> +    // end up waiting for the duration of the lazy timer).
> >> +    if (flush_flags & FLUSH_BP_WAKE && was_alldone && bypass_all_lazy)
> >> +        wake_nocb_gp(rdp, false);
> >> +
> >> +    return ret;
> >> }
> >> 
> >> /*
> >> @@ -345,7 +411,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
> >>    if (!rcu_rdp_is_offloaded(rdp) ||
> >>        !rcu_nocb_bypass_trylock(rdp))
> >>        return;
> >> -    WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
> >> +    WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
> >> }
> >> 
> >> /*
> >> @@ -367,12 +433,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
> >>  * there is only one CPU in operation.
> >>  */
> >> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -                bool *was_alldone, unsigned long flags)
> >> +                bool *was_alldone, unsigned long flags,
> >> +                bool lazy)
> >> {
> >>    unsigned long c;
> >>    unsigned long cur_gp_seq;
> >>    unsigned long j = jiffies;
> >>    long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> >> +    bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
> >> 
> >>    lockdep_assert_irqs_disabled();
> >> 
> >> @@ -417,25 +485,30 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >>    // If there hasn't yet been all that many ->cblist enqueues
> >>    // this jiffy, tell the caller to enqueue onto ->cblist.  But flush
> >>    // ->nocb_bypass first.
> >> -    if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
> >> +    // Lazy CBs throttle this back and do immediate bypass queuing.
> >> +    if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
> >>        rcu_nocb_lock(rdp);
> >>        *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> >>        if (*was_alldone)
> >>            trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> >>                        TPS("FirstQ"));
> >> -        WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
> >> +
> >> +        WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
> >>        WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
> >>        return false; // Caller must enqueue the callback.
> >>    }
> >> 
> >>    // If ->nocb_bypass has been used too long or is too full,
> >>    // flush ->nocb_bypass to ->cblist.
> >> -    if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> >> +    if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> >> +        (ncbs &&  bypass_is_lazy &&
> >> +        (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
> >>        ncbs >= qhimark) {
> >>        rcu_nocb_lock(rdp);
> >>        *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> >> 
> >> -        if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
> >> +        if (!rcu_nocb_flush_bypass(rdp, rhp, j,
> >> +                       lazy ? FLUSH_BP_LAZY : FLUSH_BP_NONE)) {
> >>            if (*was_alldone)
> >>                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> >>                            TPS("FirstQ"));
> >> @@ -460,16 +533,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >>    // We need to use the bypass.
> >>    rcu_nocb_wait_contended(rdp);
> >>    rcu_nocb_bypass_lock(rdp);
> >> +
> >>    ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> >>    rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
> >>    rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> >> +
> >> +    if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
> > 
> > Won't !IS_ENABLED(CONFIG_RCU_LAZY) mean that lazy cannot be set?
> > Why do we need to check both?  Or are you going for dead code?  If so,
> > shouldn't there be IS_ENABLED(CONFIG_RCU_LAZY) checks above as well?
> > 
> > Me, I am not convinced that the dead code would buy you much.  In fact,
> > the compiler might well be propagating the constants on its own.
> > 
> > Ah!  The reason the compiler cannot figure this out is because you put
> > the switch into rcu.h.  If you instead always export the call_rcu_flush()
> > definition, and check IS_ENABLED(CONFIG_RCU_LAZY) at the beginning of
> > call_rcu(), the compiler should have the information that it needs to
> > do this for you.
> 
> Ah ok, I will try to do it this way.

Very good, thank you, for this and for the rest below.

							Thanx, Paul

> >> +        WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
> >> +
> >>    if (!ncbs) {
> >>        WRITE_ONCE(rdp->nocb_bypass_first, j);
> >>        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
> >>    }
> >> +
> >>    rcu_nocb_bypass_unlock(rdp);
> >>    smp_mb(); /* Order enqueue before wake. */
> >> -    if (ncbs) {
> >> +
> >> +    // A wake up of the grace period kthread or timer adjustment needs to
> >> +    // be done only if:
> >> +    // 1. Bypass list was fully empty before (this is the first bypass list entry).
> >> +    //    Or, both the below conditions are met:
> >> +    // 1. Bypass list had only lazy CBs before.
> >> +    // 2. The new CB is non-lazy.
> >> +    if (ncbs && (!bypass_is_lazy || lazy)) {
> >>        local_irq_restore(flags);
> >>    } else {
> >>        // No-CBs GP kthread might be indefinitely asleep, if so, wake.
> >> @@ -499,7 +585,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
> >> {
> >>    unsigned long cur_gp_seq;
> >>    unsigned long j;
> >> -    long len;
> >> +    long len, lazy_len, bypass_len;
> >>    struct task_struct *t;
> > 
> > Again, alphabetical please, strange though that might seem.
> 
> Yes sure
> 
> > 
> >>    // If we are being polled or there is no kthread, just leave.
> >> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
> >>    }
> >>    // Need to actually to a wakeup.
> >>    len = rcu_segcblist_n_cbs(&rdp->cblist);
> >> +    bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> >> +    lazy_len = READ_ONCE(rdp->lazy_len);
> >>    if (was_alldone) {
> >>        rdp->qlen_last_fqs_check = len;
> >> -        if (!irqs_disabled_flags(flags)) {
> >> +        // Only lazy CBs in bypass list
> >> +        if (lazy_len && bypass_len == lazy_len) {
> >> +            rcu_nocb_unlock_irqrestore(rdp, flags);
> >> +            wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
> >> +                       TPS("WakeLazy"));
> >> +        } else if (!irqs_disabled_flags(flags)) {
> >>            /* ... if queue was empty ... */
> >>            rcu_nocb_unlock_irqrestore(rdp, flags);
> >>            wake_nocb_gp(rdp, false);
> >> @@ -604,8 +697,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int cpu)
> >>  */
> >> static void nocb_gp_wait(struct rcu_data *my_rdp)
> >> {
> >> -    bool bypass = false;
> >> -    long bypass_ncbs;
> >> +    bool bypass = false, lazy = false;
> >> +    long bypass_ncbs, lazy_ncbs;
> > 
> > And ditto.
> 
> Ok
> 
> > 
> >>    int __maybe_unused cpu = my_rdp->cpu;
> >>    unsigned long cur_gp_seq;
> >>    unsigned long flags;
> >> @@ -640,24 +733,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> >>     * won't be ignored for long.
> >>     */
> >>    list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
> >> +        bool flush_bypass = false;
> >> +
> >>        trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> >>        rcu_nocb_lock_irqsave(rdp, flags);
> >>        lockdep_assert_held(&rdp->nocb_lock);
> >>        bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> >> -        if (bypass_ncbs &&
> >> +        lazy_ncbs = READ_ONCE(rdp->lazy_len);
> >> +
> >> +        if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> >> +            (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> >> +             bypass_ncbs > 2 * qhimark)) {
> >> +            flush_bypass = true;
> >> +        } else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
> >>            (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
> >>             bypass_ncbs > 2 * qhimark)) {
> >> -            // Bypass full or old, so flush it.
> >> -            (void)rcu_nocb_try_flush_bypass(rdp, j);
> >> -            bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> >> +            flush_bypass = true;
> >>        } else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
> >>            rcu_nocb_unlock_irqrestore(rdp, flags);
> >>            continue; /* No callbacks here, try next. */
> >>        }
> >> +
> >> +        if (flush_bypass) {
> >> +            // Bypass full or old, so flush it.
> >> +            (void)rcu_nocb_try_flush_bypass(rdp, j);
> >> +            bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> >> +            lazy_ncbs = READ_ONCE(rdp->lazy_len);
> >> +        }
> >> +
> >>        if (bypass_ncbs) {
> >>            trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> >> -                        TPS("Bypass"));
> >> -            bypass = true;
> >> +                    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
> >> +            if (bypass_ncbs == lazy_ncbs)
> >> +                lazy = true;
> >> +            else
> >> +                bypass = true;
> >>        }
> >>        rnp = rdp->mynode;
> >> 
> >> @@ -705,12 +815,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> >>    my_rdp->nocb_gp_gp = needwait_gp;
> >>    my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
> >> 
> >> -    if (bypass && !rcu_nocb_poll) {
> >> -        // At least one child with non-empty ->nocb_bypass, so set
> >> -        // timer in order to avoid stranding its callbacks.
> >> -        wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> >> -                   TPS("WakeBypassIsDeferred"));
> >> +    // At least one child with non-empty ->nocb_bypass, so set
> >> +    // timer in order to avoid stranding its callbacks.
> >> +    if (!rcu_nocb_poll) {
> >> +        // If bypass list only has lazy CBs. Add a deferred
> >> +        // lazy wake up.
> > 
> > One sentence rather than two.
> 
> Ok
> 
> > 
> >> +        if (lazy && !bypass) {
> >> +            wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> >> +                    TPS("WakeLazyIsDeferred"));
> >> +        // Otherwise add a deferred bypass wake up.
> >> +        } else if (bypass) {
> >> +            wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> >> +                    TPS("WakeBypassIsDeferred"));
> >> +        }
> >>    }
> >> +
> >>    if (rcu_nocb_poll) {
> >>        /* Polling, so trace if first poll in the series. */
> >>        if (gotcbs)
> >> @@ -1036,7 +1155,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
> >>     * return false, which means that future calls to rcu_nocb_try_bypass()
> >>     * will refuse to put anything into the bypass.
> >>     */
> >> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> >> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_NONE));
> >>    /*
> >>     * Start with invoking rcu_core() early. This way if the current thread
> >>     * happens to preempt an ongoing call to rcu_core() in the middle,
> >> @@ -1278,6 +1397,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
> >>    raw_spin_lock_init(&rdp->nocb_gp_lock);
> >>    timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
> >>    rcu_cblist_init(&rdp->nocb_bypass);
> >> +    WRITE_ONCE(rdp->lazy_len, 0);
> >>    mutex_init(&rdp->nocb_gp_kthread_mutex);
> >> }
> >> 
> >> @@ -1559,13 +1679,13 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
> >> }
> >> 
> >> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -                  unsigned long j)
> >> +                  unsigned long j, unsigned long flush_flags)
> >> {
> >>    return true;
> >> }
> >> 
> >> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -                bool *was_alldone, unsigned long flags)
> >> +                bool *was_alldone, unsigned long flags, bool lazy)
> >> {
> >>    return false;
> >> }
> >> -- 
> >> 2.37.3.998.g577e59143f-goog
> >>

Frederic Weisbecker Sept. 24, 2022, 10:46 p.m. UTC | #4

On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	rdp->barrier_head.func = rcu_barrier_callback;
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	/*
> +	 * Flush the bypass list, but also wake up the GP thread as otherwise
> +	 * bypass/lazy CBs maynot be noticed, and can cause real long delays!
> +	 */
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));

This fixes an issue that goes beyond lazy implementation. It should be done
in a separate patch, handling rcu_segcblist_entrain() as well, with "Fixes: " tag.

And then FLUSH_BP_WAKE is probably not needed anymore. 

>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
>  	} else {
> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>  
>  	/*
> -	 * Bypass wakeup overrides previous deferments. In case
> -	 * of callback storm, no need to wake up too early.
> +	 * Bypass wakeup overrides previous deferments. In case of
> +	 * callback storm, no need to wake up too early.
>  	 */
> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
> +	if (waketype == RCU_NOCB_WAKE_LAZY
> +		&& READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {

This can be a plain READ since ->nocb_defer_wakeup is only written under ->nocb_gp_lock.

> +		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> +	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>  		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>  		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>  	} else {
> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  	}
>  	// Need to actually to a wakeup.
>  	len = rcu_segcblist_n_cbs(&rdp->cblist);
> +	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	lazy_len = READ_ONCE(rdp->lazy_len);
>  	if (was_alldone) {
>  		rdp->qlen_last_fqs_check = len;
> -		if (!irqs_disabled_flags(flags)) {
> +		// Only lazy CBs in bypass list
> +		if (lazy_len && bypass_len == lazy_len) {
> +			rcu_nocb_unlock_irqrestore(rdp, flags);
> +			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
> +					   TPS("WakeLazy"));

I'm trying to think of a case where rcu_nocb_try_bypass() returns false
(queue to regular list) but then call_rcu() -> __call_rcu_nocb_wake() ends up
seeing a lazy bypass queue even though we are queueing a non-lazy callback
(should have flushed in this case).

Looks like it shouldn't happen, even with concurrent (de-offloading) but just
in case, can we add:

      WARN_ON_ONCE(lazy_len != len)

> +		} else if (!irqs_disabled_flags(flags)) {
>  			/* ... if queue was empty ... */
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			wake_nocb_gp(rdp, false);

Thanks.

Joel Fernandes Sept. 24, 2022, 10:56 p.m. UTC | #5

> On Sep 24, 2022, at 5:11 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Sat, Sep 24, 2022 at 12:20:00PM -0400, Joel Fernandes wrote:
>> Hi Paul,
>> Let’s see whether my iPhone can handle replies ;-)
> 
> Now let's see if it can handle replies to replies!  Are you using a
> keyboard, or small-screen finger gestures?

Haha here it comes. It’s the usual onscreen
keyboard which has finger gesture support
and is quite fast to type with ;-). It’s also
connected to my apple watch so I can reply
from there but I’ll take it one step a time ;-)

>> More below:
>> 
>>>> On Sep 23, 2022, at 5:44 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
>>> 
>>> On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
>>>> Implement timer-based RCU lazy callback batching. The batch is flushed
>>>> whenever a certain amount of time has passed, or the batch on a
>>>> particular CPU grows too big. Also memory pressure will flush it in a
>>>> future patch.
>>>> 
>>>> To handle several corner cases automagically (such as rcu_barrier() and
>>>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
>>>> length has the lazy CB length included in it. A separate lazy CB length
>>>> counter is also introduced to keep track of the number of lazy CBs.
>>>> 
>>>> v5->v6:
>>>> 
>>>> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
>>>> deferral levels wake much earlier so for those it is not needed. ]
>>>> 
>>>> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
>>>> 
>>>> [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
>>>> 
>>>> [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
>>>> 
>>>> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
>>>> 
>>>> [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
>>>> 
>>>> Suggested-by: Paul McKenney <paulmck@kernel.org>
>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>> 
>>> I am going to put these on a local branch for testing, but a few comments
>>> and questions interspersed below.
>>> 
>>>                           Thanx, Paul
>> 
>> Ok I replied below, thank you!
>> 
>>>> include/linux/rcupdate.h |   7 ++
>>>> kernel/rcu/Kconfig       |   8 ++
>>>> kernel/rcu/rcu.h         |   8 ++
>>>> kernel/rcu/tiny.c        |   2 +-
>>>> kernel/rcu/tree.c        | 133 ++++++++++++++++++----------
>>>> kernel/rcu/tree.h        |  17 +++-
>>>> kernel/rcu/tree_exp.h    |   2 +-
>>>> kernel/rcu/tree_nocb.h   | 184 ++++++++++++++++++++++++++++++++-------
>>>> 8 files changed, 277 insertions(+), 84 deletions(-)
>>>> 
>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>>> index 08605ce7379d..40ae36904825 100644
>>>> --- a/include/linux/rcupdate.h
>>>> +++ b/include/linux/rcupdate.h
>>>> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
>>>> 
>>>> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>>> 
>>>> +#ifdef CONFIG_RCU_LAZY
>>>> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
>>>> +#else
>>>> +static inline void call_rcu_flush(struct rcu_head *head,
>>>> +        rcu_callback_t func) {  call_rcu(head, func); }
>>>> +#endif
>>>> +
>>>> /* Internal to kernel */
>>>> void rcu_init(void);
>>>> extern int rcu_scheduler_active;
>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>>>> index f53ad63b2bc6..edd632e68497 100644
>>>> --- a/kernel/rcu/Kconfig
>>>> +++ b/kernel/rcu/Kconfig
>>>> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
>>>>     Say N here if you hate read-side memory barriers.
>>>>     Take the default if you are unsure.
>>>> 
>>>> +config RCU_LAZY
>>>> +    bool "RCU callback lazy invocation functionality"
>>>> +    depends on RCU_NOCB_CPU
>>>> +    default n
>>>> +    help
>>>> +      To save power, batch RCU callbacks and flush after delay, memory
>>>> +      pressure or callback list growing too big.
>>>> +
>>>> endmenu # "RCU Subsystem"
>>>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>>>> index be5979da07f5..65704cbc9df7 100644
>>>> --- a/kernel/rcu/rcu.h
>>>> +++ b/kernel/rcu/rcu.h
>>>> @@ -474,6 +474,14 @@ enum rcutorture_type {
>>>>   INVALID_RCU_FLAVOR
>>>> };
>>>> 
>>>> +#if defined(CONFIG_RCU_LAZY)
>>>> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
>>>> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
>>>> +#else
>>>> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
>>>> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
>>>> +#endif
>>>> +
>>>> #if defined(CONFIG_TREE_RCU)
>>>> void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>>>>               unsigned long *gp_seq);
>>>> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
>>>> index a33a8d4942c3..810479cf17ba 100644
>>>> --- a/kernel/rcu/tiny.c
>>>> +++ b/kernel/rcu/tiny.c
>>>> @@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
>>>> 
>>>> void rcu_barrier(void)
>>>> {
>>>> -    wait_rcu_gp(call_rcu);
>>>> +    wait_rcu_gp(call_rcu_flush);
>>>> }
>>>> EXPORT_SYMBOL(rcu_barrier);
>>>> 
>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>> index 5ec97e3f7468..736d0d724207 100644
>>>> --- a/kernel/rcu/tree.c
>>>> +++ b/kernel/rcu/tree.c
>>>> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>>>>   raw_spin_unlock_rcu_node(rnp);
>>>> }
>>>> 
>>>> -/**
>>>> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
>>>> - * @head: structure to be used for queueing the RCU updates.
>>>> - * @func: actual callback function to be invoked after the grace period
>>>> - *
>>>> - * The callback function will be invoked some time after a full grace
>>>> - * period elapses, in other words after all pre-existing RCU read-side
>>>> - * critical sections have completed.  However, the callback function
>>>> - * might well execute concurrently with RCU read-side critical sections
>>>> - * that started after call_rcu() was invoked.
>>>> - *
>>>> - * RCU read-side critical sections are delimited by rcu_read_lock()
>>>> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
>>>> - * v5.0 and later, regions of code across which interrupts, preemption,
>>>> - * or softirqs have been disabled also serve as RCU read-side critical
>>>> - * sections.  This includes hardware interrupt handlers, softirq handlers,
>>>> - * and NMI handlers.
>>>> - *
>>>> - * Note that all CPUs must agree that the grace period extended beyond
>>>> - * all pre-existing RCU read-side critical section.  On systems with more
>>>> - * than one CPU, this means that when "func()" is invoked, each CPU is
>>>> - * guaranteed to have executed a full memory barrier since the end of its
>>>> - * last RCU read-side critical section whose beginning preceded the call
>>>> - * to call_rcu().  It also means that each CPU executing an RCU read-side
>>>> - * critical section that continues beyond the start of "func()" must have
>>>> - * executed a memory barrier after the call_rcu() but before the beginning
>>>> - * of that RCU read-side critical section.  Note that these guarantees
>>>> - * include CPUs that are offline, idle, or executing in user mode, as
>>>> - * well as CPUs that are executing in the kernel.
>>>> - *
>>>> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>>>> - * resulting RCU callback function "func()", then both CPU A and CPU B are
>>>> - * guaranteed to execute a full memory barrier during the time interval
>>>> - * between the call to call_rcu() and the invocation of "func()" -- even
>>>> - * if CPU A and CPU B are the same CPU (but again only if the system has
>>>> - * more than one CPU).
>>>> - *
>>>> - * Implementation of these memory-ordering guarantees is described here:
>>>> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>>>> - */
>>>> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>>> +static void
>>>> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>>>> {
>>>>   static atomic_t doublefrees;
>>>>   unsigned long flags;
>>>> @@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>>>   }
>>>> 
>>>>   check_cb_ovld(rdp);
>>>> -    if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
>>>> +    if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>>>>       return; // Enqueued onto ->nocb_bypass, so just leave.
>>>>   // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>>>>   rcu_segcblist_enqueue(&rdp->cblist, head);
>>>> @@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>>>       local_irq_restore(flags);
>>>>   }
>>>> }
>>>> -EXPORT_SYMBOL_GPL(call_rcu);
>>>> 
>>>> +#ifdef CONFIG_RCU_LAZY
>>>> +/**
>>>> + * call_rcu_flush() - Queue RCU callback for invocation after grace period, and
>>>> + * flush all lazy callbacks (including the new one) to the main ->cblist while
>>>> + * doing so.
>>>> + *
>>>> + * @head: structure to be used for queueing the RCU updates.
>>>> + * @func: actual callback function to be invoked after the grace period
>>>> + *
>>>> + * The callback function will be invoked some time after a full grace
>>>> + * period elapses, in other words after all pre-existing RCU read-side
>>>> + * critical sections have completed.
>>>> + *
>>>> + * Use this API instead of call_rcu() if you don't mind the callback being
>>>> + * invoked after very long periods of time on systems without memory pressure
>>>> + * and on systems which are lightly loaded or mostly idle.
>>> 
>>> This comment is backwards, right?  Shouldn't it say something like "Use
>>> this API instead of call_rcu() if you don't mind burning extra power..."?
>> 
>> Yes sorry my mistake :-(. It’s a stale comment from the rework and I’ll update it.
> 
> Very good, thank you!

Anything for you!

> 
>>>> + *
>>>> + * Other than the extra delay in callbacks being invoked, this function is
>>>> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
>>>> + * details about memory ordering and other functionality.
>>>> + */
>>>> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
>>>> +{
>>>> +    return __call_rcu_common(head, func, false);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(call_rcu_flush);
>>>> +#endif
>>>> +
>>>> +/**
>>>> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
>>>> + * By default the callbacks are 'lazy' and are kept hidden from the main
>>>> + * ->cblist to prevent starting of grace periods too soon.
>>>> + * If you desire grace periods to start very soon, use call_rcu_flush().
>>>> + *
>>>> + * @head: structure to be used for queueing the RCU updates.
>>>> + * @func: actual callback function to be invoked after the grace period
>>>> + *
>>>> + * The callback function will be invoked some time after a full grace
>>>> + * period elapses, in other words after all pre-existing RCU read-side
>>>> + * critical sections have completed.  However, the callback function
>>>> + * might well execute concurrently with RCU read-side critical sections
>>>> + * that started after call_rcu() was invoked.
>>>> + *
>>>> + * RCU read-side critical sections are delimited by rcu_read_lock()
>>>> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
>>>> + * v5.0 and later, regions of code across which interrupts, preemption,
>>>> + * or softirqs have been disabled also serve as RCU read-side critical
>>>> + * sections.  This includes hardware interrupt handlers, softirq handlers,
>>>> + * and NMI handlers.
>>>> + *
>>>> + * Note that all CPUs must agree that the grace period extended beyond
>>>> + * all pre-existing RCU read-side critical section.  On systems with more
>>>> + * than one CPU, this means that when "func()" is invoked, each CPU is
>>>> + * guaranteed to have executed a full memory barrier since the end of its
>>>> + * last RCU read-side critical section whose beginning preceded the call
>>>> + * to call_rcu().  It also means that each CPU executing an RCU read-side
>>>> + * critical section that continues beyond the start of "func()" must have
>>>> + * executed a memory barrier after the call_rcu() but before the beginning
>>>> + * of that RCU read-side critical section.  Note that these guarantees
>>>> + * include CPUs that are offline, idle, or executing in user mode, as
>>>> + * well as CPUs that are executing in the kernel.
>>>> + *
>>>> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>>>> + * resulting RCU callback function "func()", then both CPU A and CPU B are
>>>> + * guaranteed to execute a full memory barrier during the time interval
>>>> + * between the call to call_rcu() and the invocation of "func()" -- even
>>>> + * if CPU A and CPU B are the same CPU (but again only if the system has
>>>> + * more than one CPU).
>>>> + *
>>>> + * Implementation of these memory-ordering guarantees is described here:
>>>> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>>>> + */
>>>> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>>> +{
>>>> +    return __call_rcu_common(head, func, true);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(call_rcu);
>>>> 
>>>> /* Maximum number of jiffies to wait before draining a batch. */
>>>> #define KFREE_DRAIN_JIFFIES (5 * HZ)
>>>> @@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
>>>>       if (rcu_gp_is_expedited())
>>>>           synchronize_rcu_expedited();
>>>>       else
>>>> -            wait_rcu_gp(call_rcu);
>>>> +            wait_rcu_gp(call_rcu_flush);
>>>>       return;
>>>>   }
>>>> 
>>>> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>>>   rdp->barrier_head.func = rcu_barrier_callback;
>>>>   debug_rcu_head_queue(&rdp->barrier_head);
>>>>   rcu_nocb_lock(rdp);
>>>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>>>> +    /*
>>>> +     * Flush the bypass list, but also wake up the GP thread as otherwise
>>>> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
>>>> +     */
>>>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
>>>>   if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>>>       atomic_inc(&rcu_state.barrier_cpu_count);
>>>>   } else {
>>>> @@ -4323,7 +4364,7 @@ void rcutree_migrate_callbacks(int cpu)
>>>>   my_rdp = this_cpu_ptr(&rcu_data);
>>>>   my_rnp = my_rdp->mynode;
>>>>   rcu_nocb_lock(my_rdp); /* irqs already disabled. */
>>>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
>>>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
>>>>   raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>>>>   /* Leverage recent GPs and set GP for new callbacks. */
>>>>   needwake = rcu_advance_cbs(my_rnp, rdp) ||
>>>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>>>> index d4a97e40ea9c..361c41d642c7 100644
>>>> --- a/kernel/rcu/tree.h
>>>> +++ b/kernel/rcu/tree.h
>>>> @@ -263,14 +263,16 @@ struct rcu_data {
>>>>   unsigned long last_fqs_resched;    /* Time of last rcu_resched(). */
>>>>   unsigned long last_sched_clock;    /* Jiffies of last rcu_sched_clock_irq(). */
>>>> 
>>>> +    long lazy_len;            /* Length of buffered lazy callbacks. */
>>> 
>>> Do we ever actually care about the length as opposed to whether or not all
>>> the bypass callbacks are lazy?  If not, a "some_nonlazy" boolean would be
>>> initialed to zero and ORed with the non-laziness of the added callback.
>>> Or, if there was a test anyway, simply set to 1 in the presence of a
>>> non-lazy callback.  And as now, gets zeroed when the bypass is flushed.
>> 
>> We had discussed this before, and my point was
>> we could use it for tracing and future extension
>> as well. If it’s ok with you, I prefer to keep it this
>> way than having to come back add the length later.
>> 
>> On a minor point as well, if you really want it this
>> way,  I am afraid of changing this now since I tested
>> this way for last several iterations and it’s easy to
>> add a regression. So I prefer to keep it this way and
>> then I can add more patches later on top if that’s
>> Ok with you.
> 
> No, you are right, the debug information might be helpful.
> 
> But hey, at least I am consistent!

Haha ;-)

> 
>>> This might shorten a few lines of code.
>>> 
>>>>   int cpu;
>>>> };
>>>> 
>>>> /* Values for nocb_defer_wakeup field in struct rcu_data. */
>>>> #define RCU_NOCB_WAKE_NOT    0
>>>> #define RCU_NOCB_WAKE_BYPASS    1
>>>> -#define RCU_NOCB_WAKE        2
>>>> -#define RCU_NOCB_WAKE_FORCE    3
>>>> +#define RCU_NOCB_WAKE_LAZY    2
>>>> +#define RCU_NOCB_WAKE        3
>>>> +#define RCU_NOCB_WAKE_FORCE    4
>>>> 
>>>> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>>>>                   /* For jiffies_till_first_fqs and */
>>>> @@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
>>>> static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>>>> static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>>>> static void rcu_init_one_nocb(struct rcu_node *rnp);
>>>> +
>>>> +#define FLUSH_BP_NONE 0
>>>> +/* Is the CB being enqueued after the flush, a lazy CB? */
>>>> +#define FLUSH_BP_LAZY BIT(0)
>>>> +/* Wake up nocb-GP thread after flush? */
>>>> +#define FLUSH_BP_WAKE BIT(1)
>>>> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -                  unsigned long j);
>>>> +                  unsigned long j, unsigned long flush_flags);
>>>> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -                bool *was_alldone, unsigned long flags);
>>>> +                bool *was_alldone, unsigned long flags,
>>>> +                bool lazy);
>>>> static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>>>>                unsigned long flags);
>>>> static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>> index 18e9b4cd78ef..5cac05600798 100644
>>>> --- a/kernel/rcu/tree_exp.h
>>>> +++ b/kernel/rcu/tree_exp.h
>>>> @@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
>>>> 
>>>>   /* If expedited grace periods are prohibited, fall back to normal. */
>>>>   if (rcu_gp_is_normal()) {
>>>> -        wait_rcu_gp(call_rcu);
>>>> +        wait_rcu_gp(call_rcu_flush);
>>>>       return;
>>>>   }
>>>> 
>>>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>>>> index f77a6d7e1356..661c685aba3f 100644
>>>> --- a/kernel/rcu/tree_nocb.h
>>>> +++ b/kernel/rcu/tree_nocb.h
>>>> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>>>>   return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>>>> }
>>>> 
>>>> +/*
>>>> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
>>>> + * can elapse before lazy callbacks are flushed. Lazy callbacks
>>>> + * could be flushed much earlier for a number of other reasons
>>>> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
>>>> + * left unsubmitted to RCU after those many jiffies.
>>>> + */
>>>> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
>>>> +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
>>>> +
>>>> +#ifdef CONFIG_RCU_LAZY
>>>> +// To be called only from test code.
>>>> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
>>>> +{
>>>> +    jiffies_till_flush = jif;
>>>> +}
>>>> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
>>>> +
>>>> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
>>>> +{
>>>> +    return jiffies_till_flush;
>>>> +}
>>>> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
>>>> +#endif
>>>> +
>>>> /*
>>>> * Arrange to wake the GP kthread for this NOCB group at some future
>>>> * time when it is safe to do so.
>>>> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>>>   raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>>>> 
>>>>   /*
>>>> -     * Bypass wakeup overrides previous deferments. In case
>>>> -     * of callback storm, no need to wake up too early.
>>>> +     * Bypass wakeup overrides previous deferments. In case of
>>>> +     * callback storm, no need to wake up too early.
>>>>    */
>>>> -    if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>>> +    if (waketype == RCU_NOCB_WAKE_LAZY
>>>> +        && READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
>>> 
>>> Please leave the "&&" on the previous line and line up the "READ_ONCE("
>>> with the "waketype".  That makes it easier to tell the condition from
>>> the following code.
>> 
>> Will do!
>> 
>>> 
>>>> +        mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
>>>> +        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>>> +    } else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>>>       mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>>>>       WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>>>   } else {
>>>> @@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>>> * proves to be initially empty, just return false because the no-CB GP
>>>> * kthread may need to be awakened in this case.
>>>> *
>>>> + * Return true if there was something to be flushed and it succeeded, otherwise
>>>> + * false.
>>>> + *
>>>> * Note that this function always returns true if rhp is NULL.
>>>> */
>>>> static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -                     unsigned long j)
>>>> +                     unsigned long j, unsigned long flush_flags)
>>>> {
>>>>   struct rcu_cblist rcl;
>>>> +    bool lazy = flush_flags & FLUSH_BP_LAZY;
>>>> 
>>>>   WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
>>>>   rcu_lockdep_assert_cblist_protected(rdp);
>>>> @@ -310,7 +343,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>>   /* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
>>>>   if (rhp)
>>>>       rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>>>> -    rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>>>> +
>>>> +    /*
>>>> +     * If the new CB requested was a lazy one, queue it onto the main
>>>> +     * ->cblist so we can take advantage of a sooner grade period.
>>> 
>>> "take advantage of a grace period that will happen regardless."?
>> 
>> Sure will update.
>> 
>>> 
>>>> +     */
>>>> +    if (lazy && rhp) {
>>>> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
>>>> +        rcu_cblist_enqueue(&rcl, rhp);
>>> 
>>> Would it makes sense to enqueue rhp onto ->nocb_bypass first, NULL out
>>> rhp, then let the rcu_cblist_flush_enqueue() be common code?  Or did this
>>> function grow a later use of rhp that I missed?
>> 
>> No that could be done, but it prefer to keep it this
>> way because rhp is a function parameter and I
>> prefer not to modify those since it could add a
>> bug in future where rhp passed by user is now
>> NULL for some reason, half way through the
>> function.
> 
> I agree that changing a function parameter is bad practice.
> 
> So the question becomes whether introducing a local would outweigh
> consolidating this code.  Could you please at least give it a shot?

Yes for sure I will try this. Was thinking the same.

> 
>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
>>>> +    } else {
>>>> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
>>> 
>>> This WRITE_ONCE() can be dropped out of the "if" statement, correct?
>> 
>> Yes will update.
> 
> Thank you!
> 
>>> If so, this could be an "if" statement with two statements in its "then"
>>> clause, no "else" clause, and two statements following the "if" statement.
>> 
>> I don’t think we can get rid of the else part but I’ll see what it looks like.
> 
> In the function header, s/rhp/rhp_in/, then:
> 
>    struct rcu_head *rhp = rhp_in;
> 
> And then:
> 
>    if (lazy && rhp) {
>        rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>        rhp = NULL;
>    }
>    rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>    WRITE_ONCE(rdp->lazy_len, 0);
> 
> Or did I mess something up?

Yes you are spot on. It seems more shorter, I’ll do it this way!

> 
>>>> +    }
>>>> +
>>>>   rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>>>>   WRITE_ONCE(rdp->nocb_bypass_first, j);
>>>>   rcu_nocb_bypass_unlock(rdp);
>>>> @@ -326,13 +372,33 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> * Note that this function always returns true if rhp is NULL.
>>>> */
>>>> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -                  unsigned long j)
>>>> +                  unsigned long j, unsigned long flush_flags)
>>>> {
>>>> +    bool ret;
>>>> +    bool was_alldone = false;
>>>> +    bool bypass_all_lazy = false;
>>>> +    long bypass_ncbs;
>>> 
>>> Alphabetical order by variable name, please.  (Yes, I know that this is
>>> strange, but what can I say?)
>> 
>> Sure.
> 
> Thank you!

My pleasure!

> 
>>>> +
>>>>   if (!rcu_rdp_is_offloaded(rdp))
>>>>       return true;
>>>>   rcu_lockdep_assert_cblist_protected(rdp);
>>>>   rcu_nocb_bypass_lock(rdp);
>>>> -    return rcu_nocb_do_flush_bypass(rdp, rhp, j);
>>>> +
>>>> +    if (flush_flags & FLUSH_BP_WAKE) {
>>>> +        was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>>>> +        bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>> +        bypass_all_lazy = bypass_ncbs && (bypass_ncbs == rdp->lazy_len);
>>>> +    }
>>>> +
>>>> +    ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
>>>> +
>>>> +    // Wake up the nocb GP thread if needed. GP thread could be sleeping
>>>> +    // while waiting for lazy timer to expire (otherwise rcu_barrier may
>>>> +    // end up waiting for the duration of the lazy timer).
>>>> +    if (flush_flags & FLUSH_BP_WAKE && was_alldone && bypass_all_lazy)
>>>> +        wake_nocb_gp(rdp, false);
>>>> +
>>>> +    return ret;
>>>> }
>>>> 
>>>> /*
>>>> @@ -345,7 +411,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>>>>   if (!rcu_rdp_is_offloaded(rdp) ||
>>>>       !rcu_nocb_bypass_trylock(rdp))
>>>>       return;
>>>> -    WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
>>>> +    WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
>>>> }
>>>> 
>>>> /*
>>>> @@ -367,12 +433,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>>>> * there is only one CPU in operation.
>>>> */
>>>> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -                bool *was_alldone, unsigned long flags)
>>>> +                bool *was_alldone, unsigned long flags,
>>>> +                bool lazy)
>>>> {
>>>>   unsigned long c;
>>>>   unsigned long cur_gp_seq;
>>>>   unsigned long j = jiffies;
>>>>   long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>> +    bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
>>>> 
>>>>   lockdep_assert_irqs_disabled();
>>>> 
>>>> @@ -417,25 +485,30 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>>   // If there hasn't yet been all that many ->cblist enqueues
>>>>   // this jiffy, tell the caller to enqueue onto ->cblist.  But flush
>>>>   // ->nocb_bypass first.
>>>> -    if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
>>>> +    // Lazy CBs throttle this back and do immediate bypass queuing.
>>>> +    if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
>>>>       rcu_nocb_lock(rdp);
>>>>       *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>>>>       if (*was_alldone)
>>>>           trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>>                       TPS("FirstQ"));
>>>> -        WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
>>>> +
>>>> +        WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
>>>>       WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
>>>>       return false; // Caller must enqueue the callback.
>>>>   }
>>>> 
>>>>   // If ->nocb_bypass has been used too long or is too full,
>>>>   // flush ->nocb_bypass to ->cblist.
>>>> -    if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
>>>> +    if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
>>>> +        (ncbs &&  bypass_is_lazy &&
>>>> +        (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
>>>>       ncbs >= qhimark) {
>>>>       rcu_nocb_lock(rdp);
>>>>       *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>>>> 
>>>> -        if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
>>>> +        if (!rcu_nocb_flush_bypass(rdp, rhp, j,
>>>> +                       lazy ? FLUSH_BP_LAZY : FLUSH_BP_NONE)) {
>>>>           if (*was_alldone)
>>>>               trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>>                           TPS("FirstQ"));
>>>> @@ -460,16 +533,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>>   // We need to use the bypass.
>>>>   rcu_nocb_wait_contended(rdp);
>>>>   rcu_nocb_bypass_lock(rdp);
>>>> +
>>>>   ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>>   rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>>>>   rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>>>> +
>>>> +    if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
>>> 
>>> Won't !IS_ENABLED(CONFIG_RCU_LAZY) mean that lazy cannot be set?
>>> Why do we need to check both?  Or are you going for dead code?  If so,
>>> shouldn't there be IS_ENABLED(CONFIG_RCU_LAZY) checks above as well?
>>> 
>>> Me, I am not convinced that the dead code would buy you much.  In fact,
>>> the compiler might well be propagating the constants on its own.
>>> 
>>> Ah!  The reason the compiler cannot figure this out is because you put
>>> the switch into rcu.h.  If you instead always export the call_rcu_flush()
>>> definition, and check IS_ENABLED(CONFIG_RCU_LAZY) at the beginning of
>>> call_rcu(), the compiler should have the information that it needs to
>>> do this for you.
>> 
>> Ah ok, I will try to do it this way.
> 
> Very good, thank you, for this and for the rest below.

Glad we could do this review, thanks.

 - Joel 


> 
>                            Thanx, Paul
> 
>>>> +        WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
>>>> +
>>>>   if (!ncbs) {
>>>>       WRITE_ONCE(rdp->nocb_bypass_first, j);
>>>>       trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>>>>   }
>>>> +
>>>>   rcu_nocb_bypass_unlock(rdp);
>>>>   smp_mb(); /* Order enqueue before wake. */
>>>> -    if (ncbs) {
>>>> +
>>>> +    // A wake up of the grace period kthread or timer adjustment needs to
>>>> +    // be done only if:
>>>> +    // 1. Bypass list was fully empty before (this is the first bypass list entry).
>>>> +    //    Or, both the below conditions are met:
>>>> +    // 1. Bypass list had only lazy CBs before.
>>>> +    // 2. The new CB is non-lazy.
>>>> +    if (ncbs && (!bypass_is_lazy || lazy)) {
>>>>       local_irq_restore(flags);
>>>>   } else {
>>>>       // No-CBs GP kthread might be indefinitely asleep, if so, wake.
>>>> @@ -499,7 +585,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>>>> {
>>>>   unsigned long cur_gp_seq;
>>>>   unsigned long j;
>>>> -    long len;
>>>> +    long len, lazy_len, bypass_len;
>>>>   struct task_struct *t;
>>> 
>>> Again, alphabetical please, strange though that might seem.
>> 
>> Yes sure
>> 
>>> 
>>>>   // If we are being polled or there is no kthread, just leave.
>>>> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>>>>   }
>>>>   // Need to actually to a wakeup.
>>>>   len = rcu_segcblist_n_cbs(&rdp->cblist);
>>>> +    bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>> +    lazy_len = READ_ONCE(rdp->lazy_len);
>>>>   if (was_alldone) {
>>>>       rdp->qlen_last_fqs_check = len;
>>>> -        if (!irqs_disabled_flags(flags)) {
>>>> +        // Only lazy CBs in bypass list
>>>> +        if (lazy_len && bypass_len == lazy_len) {
>>>> +            rcu_nocb_unlock_irqrestore(rdp, flags);
>>>> +            wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
>>>> +                       TPS("WakeLazy"));
>>>> +        } else if (!irqs_disabled_flags(flags)) {
>>>>           /* ... if queue was empty ... */
>>>>           rcu_nocb_unlock_irqrestore(rdp, flags);
>>>>           wake_nocb_gp(rdp, false);
>>>> @@ -604,8 +697,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int cpu)
>>>> */
>>>> static void nocb_gp_wait(struct rcu_data *my_rdp)
>>>> {
>>>> -    bool bypass = false;
>>>> -    long bypass_ncbs;
>>>> +    bool bypass = false, lazy = false;
>>>> +    long bypass_ncbs, lazy_ncbs;
>>> 
>>> And ditto.
>> 
>> Ok
>> 
>>> 
>>>>   int __maybe_unused cpu = my_rdp->cpu;
>>>>   unsigned long cur_gp_seq;
>>>>   unsigned long flags;
>>>> @@ -640,24 +733,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>>>    * won't be ignored for long.
>>>>    */
>>>>   list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
>>>> +        bool flush_bypass = false;
>>>> +
>>>>       trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>>>>       rcu_nocb_lock_irqsave(rdp, flags);
>>>>       lockdep_assert_held(&rdp->nocb_lock);
>>>>       bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>> -        if (bypass_ncbs &&
>>>> +        lazy_ncbs = READ_ONCE(rdp->lazy_len);
>>>> +
>>>> +        if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
>>>> +            (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
>>>> +             bypass_ncbs > 2 * qhimark)) {
>>>> +            flush_bypass = true;
>>>> +        } else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
>>>>           (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
>>>>            bypass_ncbs > 2 * qhimark)) {
>>>> -            // Bypass full or old, so flush it.
>>>> -            (void)rcu_nocb_try_flush_bypass(rdp, j);
>>>> -            bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>> +            flush_bypass = true;
>>>>       } else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
>>>>           rcu_nocb_unlock_irqrestore(rdp, flags);
>>>>           continue; /* No callbacks here, try next. */
>>>>       }
>>>> +
>>>> +        if (flush_bypass) {
>>>> +            // Bypass full or old, so flush it.
>>>> +            (void)rcu_nocb_try_flush_bypass(rdp, j);
>>>> +            bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>> +            lazy_ncbs = READ_ONCE(rdp->lazy_len);
>>>> +        }
>>>> +
>>>>       if (bypass_ncbs) {
>>>>           trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>> -                        TPS("Bypass"));
>>>> -            bypass = true;
>>>> +                    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
>>>> +            if (bypass_ncbs == lazy_ncbs)
>>>> +                lazy = true;
>>>> +            else
>>>> +                bypass = true;
>>>>       }
>>>>       rnp = rdp->mynode;
>>>> 
>>>> @@ -705,12 +815,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>>>   my_rdp->nocb_gp_gp = needwait_gp;
>>>>   my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>>>> 
>>>> -    if (bypass && !rcu_nocb_poll) {
>>>> -        // At least one child with non-empty ->nocb_bypass, so set
>>>> -        // timer in order to avoid stranding its callbacks.
>>>> -        wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>>>> -                   TPS("WakeBypassIsDeferred"));
>>>> +    // At least one child with non-empty ->nocb_bypass, so set
>>>> +    // timer in order to avoid stranding its callbacks.
>>>> +    if (!rcu_nocb_poll) {
>>>> +        // If bypass list only has lazy CBs. Add a deferred
>>>> +        // lazy wake up.
>>> 
>>> One sentence rather than two.
>> 
>> Ok
>> 
>>> 
>>>> +        if (lazy && !bypass) {
>>>> +            wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
>>>> +                    TPS("WakeLazyIsDeferred"));
>>>> +        // Otherwise add a deferred bypass wake up.
>>>> +        } else if (bypass) {
>>>> +            wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>>>> +                    TPS("WakeBypassIsDeferred"));
>>>> +        }
>>>>   }
>>>> +
>>>>   if (rcu_nocb_poll) {
>>>>       /* Polling, so trace if first poll in the series. */
>>>>       if (gotcbs)
>>>> @@ -1036,7 +1155,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>>>>    * return false, which means that future calls to rcu_nocb_try_bypass()
>>>>    * will refuse to put anything into the bypass.
>>>>    */
>>>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>>>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_NONE));
>>>>   /*
>>>>    * Start with invoking rcu_core() early. This way if the current thread
>>>>    * happens to preempt an ongoing call to rcu_core() in the middle,
>>>> @@ -1278,6 +1397,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>>>>   raw_spin_lock_init(&rdp->nocb_gp_lock);
>>>>   timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>>>>   rcu_cblist_init(&rdp->nocb_bypass);
>>>> +    WRITE_ONCE(rdp->lazy_len, 0);
>>>>   mutex_init(&rdp->nocb_gp_kthread_mutex);
>>>> }
>>>> 
>>>> @@ -1559,13 +1679,13 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>>>> }
>>>> 
>>>> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -                  unsigned long j)
>>>> +                  unsigned long j, unsigned long flush_flags)
>>>> {
>>>>   return true;
>>>> }
>>>> 
>>>> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -                bool *was_alldone, unsigned long flags)
>>>> +                bool *was_alldone, unsigned long flags, bool lazy)
>>>> {
>>>>   return false;
>>>> }
>>>> -- 
>>>> 2.37.3.998.g577e59143f-goog
>>>>

Joel Fernandes Sept. 24, 2022, 11:28 p.m. UTC | #6

Hi Frederic, thanks for the response, replies
below courtesy fruit company’s device:

> On Sep 24, 2022, at 6:46 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
>> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>    rdp->barrier_head.func = rcu_barrier_callback;
>>    debug_rcu_head_queue(&rdp->barrier_head);
>>    rcu_nocb_lock(rdp);
>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>> +    /*
>> +     * Flush the bypass list, but also wake up the GP thread as otherwise
>> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
>> +     */
>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
> 
> This fixes an issue that goes beyond lazy implementation. It should be done
> in a separate patch, handling rcu_segcblist_entrain() as well, with "Fixes: " tag.

I wanted to do that, however on discussion with
Paul I thought of making this optimization only for
all lazy bypass CBs. That makes it directly related
this patch since the laziness notion is first
introduced here. On the other hand I could make
this change in a later patch since we are not
super bisectable anyway courtesy of the last
patch (which is not really an issue if the CONFIG
is kept off during someone’s bisection.

> And then FLUSH_BP_WAKE is probably not needed anymore. 

It is needed as the API is in tree_nocb.h and we
have to have that handle the details of laziness
there rather than tree.c. We could add new apis
to get rid of flag but it’s cleaner (and Paul seemed
to be ok with it).

>>    if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>        atomic_inc(&rcu_state.barrier_cpu_count);
>>    } else {
>> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>    raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>> 
>>    /*
>> -     * Bypass wakeup overrides previous deferments. In case
>> -     * of callback storm, no need to wake up too early.
>> +     * Bypass wakeup overrides previous deferments. In case of
>> +     * callback storm, no need to wake up too early.
>>     */
>> -    if (waketype == RCU_NOCB_WAKE_BYPASS) {
>> +    if (waketype == RCU_NOCB_WAKE_LAZY
>> +        && READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
> 
> This can be a plain READ since ->nocb_defer_wakeup is only written under ->nocb_gp_lock.

Yes makes sense, will do.

>> +        mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
>> +        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>> +    } else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>        mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>>        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>    } else {
>> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>>    }
>>    // Need to actually to a wakeup.
>>    len = rcu_segcblist_n_cbs(&rdp->cblist);
>> +    bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +    lazy_len = READ_ONCE(rdp->lazy_len);
>>    if (was_alldone) {
>>        rdp->qlen_last_fqs_check = len;
>> -        if (!irqs_disabled_flags(flags)) {
>> +        // Only lazy CBs in bypass list
>> +        if (lazy_len && bypass_len == lazy_len) {
>> +            rcu_nocb_unlock_irqrestore(rdp, flags);
>> +            wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
>> +                       TPS("WakeLazy"));
> 
> I'm trying to think of a case where rcu_nocb_try_bypass() returns false
> (queue to regular list) but then call_rcu() -> __call_rcu_nocb_wake() ends up
> seeing a lazy bypass queue even though we are queueing a non-lazy callback
> (should have flushed in this case).
> 
> Looks like it shouldn't happen, even with concurrent (de-offloading) but just
> in case, can we add:

Yes I also feel this couldn’t happen because irq is
off and nocb lock is held throughout the calls to
the above 2 functions. Unless I missed the race
you’re describing?

> 
>      WARN_ON_ONCE(lazy_len != len)

But this condition can be true even in normal
circumstances? len also contains DONE CBs
which are ready to be invoked. Or did I miss
something?

Thanks,

  - Joel

> 
>> +        } else if (!irqs_disabled_flags(flags)) {
>>            /* ... if queue was empty ... */
>>            rcu_nocb_unlock_irqrestore(rdp, flags);
>>            wake_nocb_gp(rdp, false);
> 
> Thanks.

Joel Fernandes Sept. 25, 2022, 1 a.m. UTC | #7

> On Sep 24, 2022, at 7:28 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> Hi Frederic, thanks for the response, replies
> below courtesy fruit company’s device:
> 
>>> On Sep 24, 2022, at 6:46 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>>> 
>>> On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
>>> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>>   rdp->barrier_head.func = rcu_barrier_callback;
>>>   debug_rcu_head_queue(&rdp->barrier_head);
>>>   rcu_nocb_lock(rdp);
>>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>>> +    /*
>>> +     * Flush the bypass list, but also wake up the GP thread as otherwise
>>> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
>>> +     */
>>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
>> 
>> This fixes an issue that goes beyond lazy implementation. It should be done
>> in a separate patch, handling rcu_segcblist_entrain() as well, with "Fixes: " tag.
> 
> I wanted to do that, however on discussion with
> Paul I thought of making this optimization only for
> all lazy bypass CBs. That makes it directly related
> this patch since the laziness notion is first
> introduced here. On the other hand I could make
> this change in a later patch since we are not
> super bisectable anyway courtesy of the last
> patch (which is not really an issue if the CONFIG
> is kept off during someone’s bisection.

Or are we saying it’s worth doing the wake up for rcu barrier even for regular bypass CB? That’d save 2 jiffies on rcu barrier. If we agree it’s needed, then yes splitting the patch makes sense.

Please let me know your opinions, thanks,

 - Joel




> 
>> And then FLUSH_BP_WAKE is probably not needed anymore. 
> 
> It is needed as the API is in tree_nocb.h and we
> have to have that handle the details of laziness
> there rather than tree.c. We could add new apis
> to get rid of flag but it’s cleaner (and Paul seemed
> to be ok with it).
> 
>>>   if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>>       atomic_inc(&rcu_state.barrier_cpu_count);
>>>   } else {
>>> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>>   raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>>> 
>>>   /*
>>> -     * Bypass wakeup overrides previous deferments. In case
>>> -     * of callback storm, no need to wake up too early.
>>> +     * Bypass wakeup overrides previous deferments. In case of
>>> +     * callback storm, no need to wake up too early.
>>>    */
>>> -    if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>> +    if (waketype == RCU_NOCB_WAKE_LAZY
>>> +        && READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
>> 
>> This can be a plain READ since ->nocb_defer_wakeup is only written under ->nocb_gp_lock.
> 
> Yes makes sense, will do.
> 
>>> +        mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
>>> +        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>> +    } else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>>       mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>>>       WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>>   } else {
>>> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>>>   }
>>>   // Need to actually to a wakeup.
>>>   len = rcu_segcblist_n_cbs(&rdp->cblist);
>>> +    bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>> +    lazy_len = READ_ONCE(rdp->lazy_len);
>>>   if (was_alldone) {
>>>       rdp->qlen_last_fqs_check = len;
>>> -        if (!irqs_disabled_flags(flags)) {
>>> +        // Only lazy CBs in bypass list
>>> +        if (lazy_len && bypass_len == lazy_len) {
>>> +            rcu_nocb_unlock_irqrestore(rdp, flags);
>>> +            wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
>>> +                       TPS("WakeLazy"));
>> 
>> I'm trying to think of a case where rcu_nocb_try_bypass() returns false
>> (queue to regular list) but then call_rcu() -> __call_rcu_nocb_wake() ends up
>> seeing a lazy bypass queue even though we are queueing a non-lazy callback
>> (should have flushed in this case).
>> 
>> Looks like it shouldn't happen, even with concurrent (de-offloading) but just
>> in case, can we add:
> 
> Yes I also feel this couldn’t happen because irq is
> off and nocb lock is held throughout the calls to
> the above 2 functions. Unless I missed the race
> you’re describing?
> 
>> 
>>     WARN_ON_ONCE(lazy_len != len)
> 
> But this condition can be true even in normal
> circumstances? len also contains DONE CBs
> which are ready to be invoked. Or did I miss
> something?
> 
> Thanks,
> 
>  - Joel
> 
>> 
>>> +        } else if (!irqs_disabled_flags(flags)) {
>>>           /* ... if queue was empty ... */
>>>           rcu_nocb_unlock_irqrestore(rdp, flags);
>>>           wake_nocb_gp(rdp, false);
>> 
>> Thanks.

Uladzislau Rezki Sept. 25, 2022, 8:57 a.m. UTC | #8

> Implement timer-based RCU lazy callback batching. The batch is flushed
> whenever a certain amount of time has passed, or the batch on a
> particular CPU grows too big. Also memory pressure will flush it in a
> future patch.
> 
> To handle several corner cases automagically (such as rcu_barrier() and
> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> length has the lazy CB length included in it. A separate lazy CB length
> counter is also introduced to keep track of the number of lazy CBs.
> 
> v5->v6:
> 
> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
>   deferral levels wake much earlier so for those it is not needed. ]
> 
> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> 
> [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> 
> [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> 
> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> 
> [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> 
I think it make sense to add some data to the commit message
illustrating what this patch does.

From my side i gave a try of this patch on my setup. Some data:

<snip>
root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_v6.script | sort -nk 6 | grep rcu
name:                       rcuop/23 pid:        184 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/26 pid:        206 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/29 pid:        227 woken-up     1     interval: min     0     max     0       avg     0
name:                        rcuop/2 pid:         35 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/34 pid:        263 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/35 pid:        270 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/36 pid:        277 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/37 pid:        284 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/38 pid:        291 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/49 pid:        370 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/59 pid:        441 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/63 pid:        469 woken-up     1     interval: min     0     max     0       avg     0
name:                        rcuog/0 pid:         16 woken-up     2     interval: min  8034     max  8034       avg  4017
name:                       rcuog/24 pid:        191 woken-up     2     interval: min  7941     max  7941       avg  3970
name:                       rcuog/32 pid:        248 woken-up     2     interval: min  7542     max  7542       avg  3771
name:                       rcuog/48 pid:        362 woken-up     2     interval: min  8065     max  8065       avg  4032
name:                       rcuog/56 pid:        419 woken-up     2     interval: min  8076     max  8076       avg  4038
name:                       rcuop/21 pid:        170 woken-up     2     interval: min 13311438  max 13311438    avg 6655719
name:                       rcuog/16 pid:        134 woken-up     4     interval: min  8029     max 13303387    avg 3329863
name:                        rcuop/9 pid:         85 woken-up     4     interval: min 10007570  max 10007586    avg 7505684
name:                        rcuog/8 pid:         77 woken-up     8     interval: min  6240     max 10001242    avg 3753622
name:                    rcu_preempt pid:         15 woken-up    18     interval: min  6058     max 9999713     avg 2140788
name:                     test_rcu/0 pid:       1411 woken-up 10003     interval: min   165     max 19072       avg  4275
root@pc638:/home/urezki/rcu_v6#

root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_default.script | sort -nk 6 | grep rcu
name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuop/51 pid:        384 woken-up     1     interval: min     0     max     0       avg     0
name:                       rcuog/32 pid:        248 woken-up     2     interval: min 11927     max 11927       avg  5963
name:                       rcuop/63 pid:        469 woken-up     2     interval: min 23963     max 23963       avg 11981
name:                       rcuog/56 pid:        419 woken-up     3     interval: min 11132     max 23967       avg 11699
name:                       rcuop/50 pid:        377 woken-up     3     interval: min  8057     max 4944344     avg 1650800
name:                       rcuog/48 pid:        362 woken-up     8     interval: min  2712     max 37430015    avg 5298801
name:                       rcuop/16 pid:        135 woken-up  4790     interval: min  7340     max 16649       avg  8843
name:                       rcuog/16 pid:        134 woken-up  4792     interval: min  7368     max 16644       avg  8844
name:                    rcu_preempt pid:         15 woken-up  5302     interval: min    26     max 12179       avg  7994
name:                     test_rcu/0 pid:       1353 woken-up 10003     interval: min   169     max 18508       avg  4236
root@pc638:/home/urezki/rcu_v6#
<snip>

so it is obvious that the patch does the job.

On my KVM machine the boot time is affected:

<snip>
[    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
[   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
[   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
[   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
[   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
[  104.115418] process '/usr/bin/fstype' started with executable stack
[  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
[  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
[  104.340193] systemd[1]: Detected virtualization kvm.
[  104.340196] systemd[1]: Detected architecture x86-64.
[  104.359032] systemd[1]: Set hostname to <pc638>.
[  105.740109] random: crng init done
[  105.741267] systemd[1]: Reached target Remote File Systems.
<snip>

2 - 11 and second delay is between 32 - 104. So there are still users which must
be waiting for "RCU" in a sync way.

> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 08605ce7379d..40ae36904825 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> +#else
> +static inline void call_rcu_flush(struct rcu_head *head,
> +		rcu_callback_t func) {  call_rcu(head, func); }
> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index f53ad63b2bc6..edd632e68497 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default n
> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +	  pressure or callback list growing too big.
> +
>
Do you think you need this kernel option? Can we just consider and make
it a run-time configurable? For example much more users will give it a try,
so it will increase a coverage. By default it can be off.

Also you do not need to do:

#ifdef LAZY
...
#else
...
#endif

>  
> +/*
> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> + * could be flushed much earlier for a number of other reasons
> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> + * left unsubmitted to RCU after those many jiffies.
> + */
> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
Make it configurable? I do not think you actually need 10 seconds here.
Reducing it will reduce a possibility to hit a low memory condition. 1
second would be far enough i think.

--
Uladzislau Rezki

Joel Fernandes Sept. 25, 2022, 5:31 p.m. UTC | #9

Hi Paul,

Back to Mutt for this one ;-)

Replies below:

On Sat, Sep 24, 2022 at 02:11:32PM -0700, Paul E. McKenney wrote:
[...]
> > >> +     */
> > >> +    if (lazy && rhp) {
> > >> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> > >> +        rcu_cblist_enqueue(&rcl, rhp);
> > > 
> > > Would it makes sense to enqueue rhp onto ->nocb_bypass first, NULL out
> > > rhp, then let the rcu_cblist_flush_enqueue() be common code?  Or did this
> > > function grow a later use of rhp that I missed?
> > 
> > No that could be done, but it prefer to keep it this
> >  way because rhp is a function parameter and I
> > prefer not to modify those since it could add a
> > bug in future where rhp passed by user is now
> > NULL for some reason, half way through the
> > function.
> 
> I agree that changing a function parameter is bad practice.
> 
> So the question becomes whether introducing a local would outweigh
> consolidating this code.  Could you please at least give it a shot?
> 
> > >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > >> +    } else {
> > >> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > 
> > > This WRITE_ONCE() can be dropped out of the "if" statement, correct?
> > 
> > Yes will update.
> 
> Thank you!
> 
> > > If so, this could be an "if" statement with two statements in its "then"
> > > clause, no "else" clause, and two statements following the "if" statement.
> > 
> > I don’t think we can get rid of the else part but I’ll see what it looks like.
> 
> In the function header, s/rhp/rhp_in/, then:
> 
> 	struct rcu_head *rhp = rhp_in;
> 
> And then:
> 
> 	if (lazy && rhp) {
> 		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> 		rhp = NULL;

This enqueues on to the bypass list, where as if lazy && rhp, I want to queue
the new rhp on to the main cblist. So the pseudo code in my patch is:

if (lazy and rhp) then
	1. flush bypass CBs on to main list.
	2. queue new CB on to main list.
else
	1. flush bypass CBs on to main list
	2. queue new CB on to bypass list.

> 	}
> 	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> 	WRITE_ONCE(rdp->lazy_len, 0);
> 
> Or did I mess something up?

So the rcu_cblist_flush_enqueue() has to happen before the
rcu_cblist_enqueue() to preserve the ordering of flushing into the main list,
and queuing on to the main list for the "if". Where as in your snip, the
order is reversed.

If I consolidate it then, it looks like the following. However, it is a bit
more unreadable. I could instead just take the WRITE_ONCE out of both if/else
and move it to after the if/else, that would be cleanest. Does that sound
good to you? Thanks!

---8<-----------------------

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 1a182b9c4f6c..bd3f54d314e8 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -327,10 +327,11 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
  *
  * Note that this function always returns true if rhp is NULL.
  */
-static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp_in,
 				     unsigned long j, unsigned long flush_flags)
 {
 	struct rcu_cblist rcl;
+	struct rcu_head *rhp = rhp_in;
 	bool lazy = flush_flags & FLUSH_BP_LAZY;
 
 	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
@@ -348,14 +349,13 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	 * If the new CB requested was a lazy one, queue it onto the main
 	 * ->cblist so we can take advantage of a sooner grade period.
 	 */
-	if (lazy && rhp) {
-		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
-		rcu_cblist_enqueue(&rcl, rhp);
-		WRITE_ONCE(rdp->lazy_len, 0);
-	} else {
-		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
-		WRITE_ONCE(rdp->lazy_len, 0);
-	}
+	if (lazy && rhp)
+		rhp = NULL;
+	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+	if (lazy && rhp_in)
+		rcu_cblist_enqueue(&rcl, rhp_in);
+
+	WRITE_ONCE(rdp->lazy_len, 0);
 
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);

Joel Fernandes Sept. 25, 2022, 5:46 p.m. UTC | #10

Hi Vlad,

On Sun, Sep 25, 2022 at 10:57:10AM +0200, Uladzislau Rezki wrote:
> > Implement timer-based RCU lazy callback batching. The batch is flushed
> > whenever a certain amount of time has passed, or the batch on a
> > particular CPU grows too big. Also memory pressure will flush it in a
> > future patch.
> > 
> > To handle several corner cases automagically (such as rcu_barrier() and
> > hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> > length has the lazy CB length included in it. A separate lazy CB length
> > counter is also introduced to keep track of the number of lazy CBs.
> > 
> > v5->v6:
> > 
> > [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> >   deferral levels wake much earlier so for those it is not needed. ]
> > 
> > [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> > 
> > [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> > 
> > [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> > 
> > [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> > 
> > [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> > 
> I think it make sense to add some data to the commit message
> illustrating what this patch does.

Sure, will do!

> From my side i gave a try of this patch on my setup. Some data:
> 
> <snip>
> root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_v6.script | sort -nk 6 | grep rcu
> name:                       rcuop/23 pid:        184 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/26 pid:        206 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/29 pid:        227 woken-up     1     interval: min     0     max     0       avg     0
> name:                        rcuop/2 pid:         35 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/34 pid:        263 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/35 pid:        270 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/36 pid:        277 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/37 pid:        284 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/38 pid:        291 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/49 pid:        370 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/59 pid:        441 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/63 pid:        469 woken-up     1     interval: min     0     max     0       avg     0
> name:                        rcuog/0 pid:         16 woken-up     2     interval: min  8034     max  8034       avg  4017
> name:                       rcuog/24 pid:        191 woken-up     2     interval: min  7941     max  7941       avg  3970
> name:                       rcuog/32 pid:        248 woken-up     2     interval: min  7542     max  7542       avg  3771
> name:                       rcuog/48 pid:        362 woken-up     2     interval: min  8065     max  8065       avg  4032
> name:                       rcuog/56 pid:        419 woken-up     2     interval: min  8076     max  8076       avg  4038
> name:                       rcuop/21 pid:        170 woken-up     2     interval: min 13311438  max 13311438    avg 6655719
> name:                       rcuog/16 pid:        134 woken-up     4     interval: min  8029     max 13303387    avg 3329863
> name:                        rcuop/9 pid:         85 woken-up     4     interval: min 10007570  max 10007586    avg 7505684
> name:                        rcuog/8 pid:         77 woken-up     8     interval: min  6240     max 10001242    avg 3753622
> name:                    rcu_preempt pid:         15 woken-up    18     interval: min  6058     max 9999713     avg 2140788
> name:                     test_rcu/0 pid:       1411 woken-up 10003     interval: min   165     max 19072       avg  4275
> root@pc638:/home/urezki/rcu_v6#
> 
> root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_default.script | sort -nk 6 | grep rcu
> name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuop/51 pid:        384 woken-up     1     interval: min     0     max     0       avg     0
> name:                       rcuog/32 pid:        248 woken-up     2     interval: min 11927     max 11927       avg  5963
> name:                       rcuop/63 pid:        469 woken-up     2     interval: min 23963     max 23963       avg 11981
> name:                       rcuog/56 pid:        419 woken-up     3     interval: min 11132     max 23967       avg 11699
> name:                       rcuop/50 pid:        377 woken-up     3     interval: min  8057     max 4944344     avg 1650800
> name:                       rcuog/48 pid:        362 woken-up     8     interval: min  2712     max 37430015    avg 5298801
> name:                       rcuop/16 pid:        135 woken-up  4790     interval: min  7340     max 16649       avg  8843
> name:                       rcuog/16 pid:        134 woken-up  4792     interval: min  7368     max 16644       avg  8844
> name:                    rcu_preempt pid:         15 woken-up  5302     interval: min    26     max 12179       avg  7994
> name:                     test_rcu/0 pid:       1353 woken-up 10003     interval: min   169     max 18508       avg  4236
> root@pc638:/home/urezki/rcu_v6#
> <snip>
> 
> so it is obvious that the patch does the job.

Thanks a lot for testing!

> On my KVM machine the boot time is affected:
> 
> <snip>
> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> [  104.115418] process '/usr/bin/fstype' started with executable stack
> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> [  104.340193] systemd[1]: Detected virtualization kvm.
> [  104.340196] systemd[1]: Detected architecture x86-64.
> [  104.359032] systemd[1]: Set hostname to <pc638>.
> [  105.740109] random: crng init done
> [  105.741267] systemd[1]: Reached target Remote File Systems.
> <snip>
> 
> 2 - 11 and second delay is between 32 - 104. So there are still users which must
> be waiting for "RCU" in a sync way.

I was wondering if you can compare boot logs and see which timestamp does the
slow down start from. That way, we can narrow down the callback. Also another
idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
ftrace_dump_on_oops" to the boot params, and then manually call
"tracing_off(); panic();" from the code at the first printk that seems off in
your comparison of good vs bad. For example, if "crng init done" timestamp is
off, put the "tracing_off(); panic();" there. Then grab the serial console
output to see what were the last callbacks that was queued/invoked.

> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index 08605ce7379d..40ae36904825 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> >  
> >  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> >  
> > +#ifdef CONFIG_RCU_LAZY
> > +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> > +#else
> > +static inline void call_rcu_flush(struct rcu_head *head,
> > +		rcu_callback_t func) {  call_rcu(head, func); }
> > +#endif
> > +
> >  /* Internal to kernel */
> >  void rcu_init(void);
> >  extern int rcu_scheduler_active;
> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index f53ad63b2bc6..edd632e68497 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> >  	  Say N here if you hate read-side memory barriers.
> >  	  Take the default if you are unsure.
> >  
> > +config RCU_LAZY
> > +	bool "RCU callback lazy invocation functionality"
> > +	depends on RCU_NOCB_CPU
> > +	default n
> > +	help
> > +	  To save power, batch RCU callbacks and flush after delay, memory
> > +	  pressure or callback list growing too big.
> > +
> >
> Do you think you need this kernel option? Can we just consider and make
> it a run-time configurable? For example much more users will give it a try,
> so it will increase a coverage. By default it can be off.
> 
> Also you do not need to do:
> 
> #ifdef LAZY

How does the "LAZY" macro end up being runtime-configurable? That's static /
compile time. Did I miss something?

> ...
> #else
> ...
> #endif
> 
> >  
> > +/*
> > + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> > + * can elapse before lazy callbacks are flushed. Lazy callbacks
> > + * could be flushed much earlier for a number of other reasons
> > + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> > + * left unsubmitted to RCU after those many jiffies.
> > + */
> > +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> > +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> Make it configurable? I do not think you actually need 10 seconds here.
> Reducing it will reduce a possibility to hit a low memory condition. 1
> second would be far enough i think.

Hmm, I can make the delay configurable but for now I'll keep this as default
as all of our power testing has been done with that and I don't want risk
losing the optimization.

Honestly, I am not worried too about memory pressure as we have a shrinker
which triggers flushes on the slightest hint of memory pressure. If it is not
handling it properly, then we need to fix the shrinker.

thanks,

 - Joel

Frederic Weisbecker Sept. 25, 2022, 10 p.m. UTC | #11

On Sat, Sep 24, 2022 at 09:00:39PM -0400, Joel Fernandes wrote:
> 
> 
> > On Sep 24, 2022, at 7:28 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > 
> > Hi Frederic, thanks for the response, replies
> > below courtesy fruit company’s device:
> > 
> >>> On Sep 24, 2022, at 6:46 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> >>> 
> >>> On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
> >>> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> >>>   rdp->barrier_head.func = rcu_barrier_callback;
> >>>   debug_rcu_head_queue(&rdp->barrier_head);
> >>>   rcu_nocb_lock(rdp);
> >>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> >>> +    /*
> >>> +     * Flush the bypass list, but also wake up the GP thread as otherwise
> >>> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
> >>> +     */
> >>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
> >> 
> >> This fixes an issue that goes beyond lazy implementation. It should be done
> >> in a separate patch, handling rcu_segcblist_entrain() as well, with "Fixes: " tag.
> > 
> > I wanted to do that, however on discussion with
> > Paul I thought of making this optimization only for
> > all lazy bypass CBs. That makes it directly related
> > this patch since the laziness notion is first
> > introduced here. On the other hand I could make
> > this change in a later patch since we are not
> > super bisectable anyway courtesy of the last
> > patch (which is not really an issue if the CONFIG
> > is kept off during someone’s bisection.
> 
> Or are we saying it’s worth doing the wake up for rcu barrier even for regular bypass CB? That’d save 2 jiffies on rcu barrier. If we agree it’s needed, then yes splitting the patch makes sense.
> 
> Please let me know your opinions, thanks,
> 
>  - Joel

Sure, I mean since we are fixing the buggy rcu_barrier_entrain() anyway, let's
just fix bypass as well. Such as in the following (untested):

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b39e97175a9e..a0df964abb0e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3834,6 +3834,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 {
 	unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
 	unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
+	bool wake_nocb = false;
+	bool was_alldone = false;
 
 	lockdep_assert_held(&rcu_state.barrier_lock);
 	if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
@@ -3842,6 +3844,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func = rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
+	if (rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist))
+		was_alldone = true;
 	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
@@ -3849,7 +3853,12 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 		debug_rcu_head_unqueue(&rdp->barrier_head);
 		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
 	}
+	if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
+		wake_nocb = true;
 	rcu_nocb_unlock(rdp);
+	if (wake_nocb)
+		wake_nocb_gp(rdp, false);
+
 	smp_store_release(&rdp->barrier_seq_snap, gseq);
 }
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index d4a97e40ea9c..925dd98f8b23 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -439,6 +439,7 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 538a0ed93946..e1701aa9c82c 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1600,6 +1600,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 {
 }
 
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
+{
+}
+
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j)
 {

Frederic Weisbecker Sept. 25, 2022, 10:09 p.m. UTC | #12

On Sat, Sep 24, 2022 at 07:28:16PM -0400, Joel Fernandes wrote:
> > And then FLUSH_BP_WAKE is probably not needed anymore. 
> 
> It is needed as the API is in tree_nocb.h and we
> have to have that handle the details of laziness
> there rather than tree.c. We could add new apis
> to get rid of flag but it’s cleaner (and Paul seemed
> to be ok with it).

If the wake up is handled outside the flush function, as in the
diff I just posted, there is no more user left of FLUSH_BP_WAKE, IIRC...

> >> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
> >>    }
> >>    // Need to actually to a wakeup.
> >>    len = rcu_segcblist_n_cbs(&rdp->cblist);
> >> +    bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> >> +    lazy_len = READ_ONCE(rdp->lazy_len);
> >>    if (was_alldone) {
> >>        rdp->qlen_last_fqs_check = len;
> >> -        if (!irqs_disabled_flags(flags)) {
> >> +        // Only lazy CBs in bypass list
> >> +        if (lazy_len && bypass_len == lazy_len) {
> >> +            rcu_nocb_unlock_irqrestore(rdp, flags);
> >> +            wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
> >> +                       TPS("WakeLazy"));
> > 
> > I'm trying to think of a case where rcu_nocb_try_bypass() returns false
> > (queue to regular list) but then call_rcu() -> __call_rcu_nocb_wake() ends up
> > seeing a lazy bypass queue even though we are queueing a non-lazy callback
> > (should have flushed in this case).
> > 
> > Looks like it shouldn't happen, even with concurrent (de-offloading) but just
> > in case, can we add:
> 
> Yes I also feel this couldn’t happen because irq is
> off and nocb lock is held throughout the calls to
> the above 2 functions. Unless I missed the race
> you’re describing?

At least I can't find any either...

> 
> > 
> >      WARN_ON_ONCE(lazy_len != len)
> 
> But this condition can be true even in normal
> circumstances? len also contains DONE CBs
> which are ready to be invoked. Or did I miss
> something?

Duh, good point, nevermind then :-)

Thanks.

> 
> Thanks,
> 
>   - Joel
> 
> > 
> >> +        } else if (!irqs_disabled_flags(flags)) {
> >>            /* ... if queue was empty ... */
> >>            rcu_nocb_unlock_irqrestore(rdp, flags);
> >>            wake_nocb_gp(rdp, false);
> > 
> > Thanks.

Joel Fernandes Sept. 26, 2022, 3:04 p.m. UTC | #13

On Mon, Sep 26, 2022 at 12:00:45AM +0200, Frederic Weisbecker wrote:
> On Sat, Sep 24, 2022 at 09:00:39PM -0400, Joel Fernandes wrote:
> > 
> > 
> > > On Sep 24, 2022, at 7:28 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > 
> > > Hi Frederic, thanks for the response, replies
> > > below courtesy fruit company’s device:
> > > 
> > >>> On Sep 24, 2022, at 6:46 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > >>> 
> > >>> On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
> > >>> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> > >>>   rdp->barrier_head.func = rcu_barrier_callback;
> > >>>   debug_rcu_head_queue(&rdp->barrier_head);
> > >>>   rcu_nocb_lock(rdp);
> > >>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> > >>> +    /*
> > >>> +     * Flush the bypass list, but also wake up the GP thread as otherwise
> > >>> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
> > >>> +     */
> > >>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
> > >> 
> > >> This fixes an issue that goes beyond lazy implementation. It should be done
> > >> in a separate patch, handling rcu_segcblist_entrain() as well, with "Fixes: " tag.
> > > 
> > > I wanted to do that, however on discussion with
> > > Paul I thought of making this optimization only for
> > > all lazy bypass CBs. That makes it directly related
> > > this patch since the laziness notion is first
> > > introduced here. On the other hand I could make
> > > this change in a later patch since we are not
> > > super bisectable anyway courtesy of the last
> > > patch (which is not really an issue if the CONFIG
> > > is kept off during someone’s bisection.
> > 
> > Or are we saying it’s worth doing the wake up for rcu barrier even for
> > regular bypass CB? That’d save 2 jiffies on rcu barrier. If we agree it’s
> > needed, then yes splitting the patch makes sense.
> > 
> > Please let me know your opinions, thanks,
> > 
> >  - Joel
> 
> Sure, I mean since we are fixing the buggy rcu_barrier_entrain() anyway, let's
> just fix bypass as well. Such as in the following (untested):

Got it. This sounds good to me, and will simplify the code a bit more for sure.

I guess a question for Paul - are you Ok with rcu_barrier() causing wake ups
if the bypass list has any non-lazy CBs as well? That should be OK, IMO.

> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index b39e97175a9e..a0df964abb0e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3834,6 +3834,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  {
>  	unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
>  	unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
> +	bool wake_nocb = false;
> +	bool was_alldone = false;
>  
>  	lockdep_assert_held(&rcu_state.barrier_lock);
>  	if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
> @@ -3842,6 +3844,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	rdp->barrier_head.func = rcu_barrier_callback;
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
> +	if (rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist))
> +		was_alldone = true;
>  	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
> @@ -3849,7 +3853,12 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  		debug_rcu_head_unqueue(&rdp->barrier_head);
>  		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
>  	}
> +	if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
> +		wake_nocb = true;
>  	rcu_nocb_unlock(rdp);
> +	if (wake_nocb)
> +		wake_nocb_gp(rdp, false);
> +

Thanks for the code snippet, I like how you are checking if the bypass list
is empty, without actually checking it ;-)

thanks,

 - Joel

Paul E. McKenney Sept. 26, 2022, 5:33 p.m. UTC | #14

On Mon, Sep 26, 2022 at 03:04:38PM +0000, Joel Fernandes wrote:
> On Mon, Sep 26, 2022 at 12:00:45AM +0200, Frederic Weisbecker wrote:
> > On Sat, Sep 24, 2022 at 09:00:39PM -0400, Joel Fernandes wrote:
> > > 
> > > 
> > > > On Sep 24, 2022, at 7:28 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > 
> > > > Hi Frederic, thanks for the response, replies
> > > > below courtesy fruit company’s device:
> > > > 
> > > >>> On Sep 24, 2022, at 6:46 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > >>> 
> > > >>> On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
> > > >>> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> > > >>>   rdp->barrier_head.func = rcu_barrier_callback;
> > > >>>   debug_rcu_head_queue(&rdp->barrier_head);
> > > >>>   rcu_nocb_lock(rdp);
> > > >>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> > > >>> +    /*
> > > >>> +     * Flush the bypass list, but also wake up the GP thread as otherwise
> > > >>> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
> > > >>> +     */
> > > >>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
> > > >> 
> > > >> This fixes an issue that goes beyond lazy implementation. It should be done
> > > >> in a separate patch, handling rcu_segcblist_entrain() as well, with "Fixes: " tag.
> > > > 
> > > > I wanted to do that, however on discussion with
> > > > Paul I thought of making this optimization only for
> > > > all lazy bypass CBs. That makes it directly related
> > > > this patch since the laziness notion is first
> > > > introduced here. On the other hand I could make
> > > > this change in a later patch since we are not
> > > > super bisectable anyway courtesy of the last
> > > > patch (which is not really an issue if the CONFIG
> > > > is kept off during someone’s bisection.
> > > 
> > > Or are we saying it’s worth doing the wake up for rcu barrier even for
> > > regular bypass CB? That’d save 2 jiffies on rcu barrier. If we agree it’s
> > > needed, then yes splitting the patch makes sense.
> > > 
> > > Please let me know your opinions, thanks,
> > > 
> > >  - Joel
> > 
> > Sure, I mean since we are fixing the buggy rcu_barrier_entrain() anyway, let's
> > just fix bypass as well. Such as in the following (untested):
> 
> Got it. This sounds good to me, and will simplify the code a bit more for sure.
> 
> I guess a question for Paul - are you Ok with rcu_barrier() causing wake ups
> if the bypass list has any non-lazy CBs as well? That should be OK, IMO.

In theory, I am OK with it.  In practice, you are the guys with the
hardware that can measure power consumption, not me!  ;-)

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index b39e97175a9e..a0df964abb0e 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3834,6 +3834,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> >  {
> >  	unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
> >  	unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
> > +	bool wake_nocb = false;
> > +	bool was_alldone = false;
> >  
> >  	lockdep_assert_held(&rcu_state.barrier_lock);
> >  	if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
> > @@ -3842,6 +3844,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> >  	rdp->barrier_head.func = rcu_barrier_callback;
> >  	debug_rcu_head_queue(&rdp->barrier_head);
> >  	rcu_nocb_lock(rdp);
> > +	if (rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist))
> > +		was_alldone = true;
> >  	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> >  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
> >  		atomic_inc(&rcu_state.barrier_cpu_count);
> > @@ -3849,7 +3853,12 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> >  		debug_rcu_head_unqueue(&rdp->barrier_head);
> >  		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
> >  	}
> > +	if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
> > +		wake_nocb = true;
> >  	rcu_nocb_unlock(rdp);
> > +	if (wake_nocb)
> > +		wake_nocb_gp(rdp, false);
> > +
> 
> Thanks for the code snippet, I like how you are checking if the bypass list
> is empty, without actually checking it ;-)

That certainly is consistent with the RCU philosophy.  :-)

							Thanx, Paul

Paul E. McKenney Sept. 26, 2022, 5:42 p.m. UTC | #15

On Sun, Sep 25, 2022 at 05:31:13PM +0000, Joel Fernandes wrote:
> Hi Paul,
> 
> Back to Mutt for this one ;-)

As they used to say in my youth, "Get a horse!"  ;-)

> Replies below:
> 
> On Sat, Sep 24, 2022 at 02:11:32PM -0700, Paul E. McKenney wrote:
> [...]
> > > >> +     */
> > > >> +    if (lazy && rhp) {
> > > >> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> > > >> +        rcu_cblist_enqueue(&rcl, rhp);
> > > > 
> > > > Would it makes sense to enqueue rhp onto ->nocb_bypass first, NULL out
> > > > rhp, then let the rcu_cblist_flush_enqueue() be common code?  Or did this
> > > > function grow a later use of rhp that I missed?
> > > 
> > > No that could be done, but it prefer to keep it this
> > >  way because rhp is a function parameter and I
> > > prefer not to modify those since it could add a
> > > bug in future where rhp passed by user is now
> > > NULL for some reason, half way through the
> > > function.
> > 
> > I agree that changing a function parameter is bad practice.
> > 
> > So the question becomes whether introducing a local would outweigh
> > consolidating this code.  Could you please at least give it a shot?
> > 
> > > >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > >> +    } else {
> > > >> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > > >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > > 
> > > > This WRITE_ONCE() can be dropped out of the "if" statement, correct?
> > > 
> > > Yes will update.
> > 
> > Thank you!
> > 
> > > > If so, this could be an "if" statement with two statements in its "then"
> > > > clause, no "else" clause, and two statements following the "if" statement.
> > > 
> > > I don’t think we can get rid of the else part but I’ll see what it looks like.
> > 
> > In the function header, s/rhp/rhp_in/, then:
> > 
> > 	struct rcu_head *rhp = rhp_in;
> > 
> > And then:
> > 
> > 	if (lazy && rhp) {
> > 		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> > 		rhp = NULL;
> 
> This enqueues on to the bypass list, where as if lazy && rhp, I want to queue
> the new rhp on to the main cblist. So the pseudo code in my patch is:
> 
> if (lazy and rhp) then
> 	1. flush bypass CBs on to main list.
> 	2. queue new CB on to main list.

And the difference is here, correct?  I enqueue to the bypass list,
which is then flushed (in order) to the main list.  In contrast, you
flush the bypass list, then enqueue to the main list.  Either way,
the callback referenced by rhp ends up at the end of ->cblist.

Or am I on the wrong branch of this "if" statement?

> else
> 	1. flush bypass CBs on to main list
> 	2. queue new CB on to bypass list.
> 
> > 	}
> > 	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > 	WRITE_ONCE(rdp->lazy_len, 0);
> > 
> > Or did I mess something up?
> 
> So the rcu_cblist_flush_enqueue() has to happen before the
> rcu_cblist_enqueue() to preserve the ordering of flushing into the main list,
> and queuing on to the main list for the "if". Where as in your snip, the
> order is reversed.

Did I pick the correct branch of the "if" statement above?  Or were you
instead talking about the "else" clause?

I would have been more worried about getting cblist->len right.

> If I consolidate it then, it looks like the following. However, it is a bit
> more unreadable. I could instead just take the WRITE_ONCE out of both if/else
> and move it to after the if/else, that would be cleanest. Does that sound
> good to you? Thanks!

Let's first figure out whether or not we are talking past one another.  ;-)

							Thanx, Paul

> ---8<-----------------------
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 1a182b9c4f6c..bd3f54d314e8 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -327,10 +327,11 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>   *
>   * Note that this function always returns true if rhp is NULL.
>   */
> -static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> +static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp_in,
>  				     unsigned long j, unsigned long flush_flags)
>  {
>  	struct rcu_cblist rcl;
> +	struct rcu_head *rhp = rhp_in;
>  	bool lazy = flush_flags & FLUSH_BP_LAZY;
>  
>  	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
> @@ -348,14 +349,13 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	 * If the new CB requested was a lazy one, queue it onto the main
>  	 * ->cblist so we can take advantage of a sooner grade period.
>  	 */
> -	if (lazy && rhp) {
> -		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> -		rcu_cblist_enqueue(&rcl, rhp);
> -		WRITE_ONCE(rdp->lazy_len, 0);
> -	} else {
> -		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> -		WRITE_ONCE(rdp->lazy_len, 0);
> -	}
> +	if (lazy && rhp)
> +		rhp = NULL;
> +	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +	if (lazy && rhp_in)
> +		rcu_cblist_enqueue(&rcl, rhp_in);
> +
> +	WRITE_ONCE(rdp->lazy_len, 0);
>  
>  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>  	WRITE_ONCE(rdp->nocb_bypass_first, j);

Paul E. McKenney Sept. 26, 2022, 5:45 p.m. UTC | #16

On Mon, Sep 26, 2022 at 12:09:36AM +0200, Frederic Weisbecker wrote:
> On Sat, Sep 24, 2022 at 07:28:16PM -0400, Joel Fernandes wrote:
> > > And then FLUSH_BP_WAKE is probably not needed anymore. 
> > 
> > It is needed as the API is in tree_nocb.h and we
> > have to have that handle the details of laziness
> > there rather than tree.c. We could add new apis
> > to get rid of flag but it’s cleaner (and Paul seemed
> > to be ok with it).
> 
> If the wake up is handled outside the flush function, as in the
> diff I just posted, there is no more user left of FLUSH_BP_WAKE, IIRC...

To get rid of FLUSH_BP_WAKE, we might need to pull some rcu_data fields
out from under #ifdef in order to allow them to be accessed by common
code.  Which might be a good tradeoff, as the size of rcu_data has not
been a concern.  Plus the increase in size would be quite small.

							Thanx, Paul

> > >> @@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
> > >>    }
> > >>    // Need to actually to a wakeup.
> > >>    len = rcu_segcblist_n_cbs(&rdp->cblist);
> > >> +    bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > >> +    lazy_len = READ_ONCE(rdp->lazy_len);
> > >>    if (was_alldone) {
> > >>        rdp->qlen_last_fqs_check = len;
> > >> -        if (!irqs_disabled_flags(flags)) {
> > >> +        // Only lazy CBs in bypass list
> > >> +        if (lazy_len && bypass_len == lazy_len) {
> > >> +            rcu_nocb_unlock_irqrestore(rdp, flags);
> > >> +            wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
> > >> +                       TPS("WakeLazy"));
> > > 
> > > I'm trying to think of a case where rcu_nocb_try_bypass() returns false
> > > (queue to regular list) but then call_rcu() -> __call_rcu_nocb_wake() ends up
> > > seeing a lazy bypass queue even though we are queueing a non-lazy callback
> > > (should have flushed in this case).
> > > 
> > > Looks like it shouldn't happen, even with concurrent (de-offloading) but just
> > > in case, can we add:
> > 
> > Yes I also feel this couldn’t happen because irq is
> > off and nocb lock is held throughout the calls to
> > the above 2 functions. Unless I missed the race
> > you’re describing?
> 
> At least I can't find any either...
> 
> > 
> > > 
> > >      WARN_ON_ONCE(lazy_len != len)
> > 
> > But this condition can be true even in normal
> > circumstances? len also contains DONE CBs
> > which are ready to be invoked. Or did I miss
> > something?
> 
> Duh, good point, nevermind then :-)
> 
> Thanks.
> 
> > 
> > Thanks,
> > 
> >   - Joel
> > 
> > > 
> > >> +        } else if (!irqs_disabled_flags(flags)) {
> > >>            /* ... if queue was empty ... */
> > >>            rcu_nocb_unlock_irqrestore(rdp, flags);
> > >>            wake_nocb_gp(rdp, false);
> > > 
> > > Thanks.

Paul E. McKenney Sept. 26, 2022, 5:48 p.m. UTC | #17

On Sun, Sep 25, 2022 at 05:46:53PM +0000, Joel Fernandes wrote:
> Hi Vlad,
> 
> On Sun, Sep 25, 2022 at 10:57:10AM +0200, Uladzislau Rezki wrote:
> > > Implement timer-based RCU lazy callback batching. The batch is flushed
> > > whenever a certain amount of time has passed, or the batch on a
> > > particular CPU grows too big. Also memory pressure will flush it in a
> > > future patch.
> > > 
> > > To handle several corner cases automagically (such as rcu_barrier() and
> > > hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> > > length has the lazy CB length included in it. A separate lazy CB length
> > > counter is also introduced to keep track of the number of lazy CBs.
> > > 
> > > v5->v6:
> > > 
> > > [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> > >   deferral levels wake much earlier so for those it is not needed. ]
> > > 
> > > [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> > > 
> > > [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> > > 
> > > [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> > > 
> > > [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> > > 
> > > [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> > > 
> > I think it make sense to add some data to the commit message
> > illustrating what this patch does.
> 
> Sure, will do!
> 
> > From my side i gave a try of this patch on my setup. Some data:
> > 
> > <snip>
> > root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_v6.script | sort -nk 6 | grep rcu
> > name:                       rcuop/23 pid:        184 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/26 pid:        206 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/29 pid:        227 woken-up     1     interval: min     0     max     0       avg     0
> > name:                        rcuop/2 pid:         35 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/34 pid:        263 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/35 pid:        270 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/36 pid:        277 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/37 pid:        284 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/38 pid:        291 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/49 pid:        370 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/59 pid:        441 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/63 pid:        469 woken-up     1     interval: min     0     max     0       avg     0
> > name:                        rcuog/0 pid:         16 woken-up     2     interval: min  8034     max  8034       avg  4017
> > name:                       rcuog/24 pid:        191 woken-up     2     interval: min  7941     max  7941       avg  3970
> > name:                       rcuog/32 pid:        248 woken-up     2     interval: min  7542     max  7542       avg  3771
> > name:                       rcuog/48 pid:        362 woken-up     2     interval: min  8065     max  8065       avg  4032
> > name:                       rcuog/56 pid:        419 woken-up     2     interval: min  8076     max  8076       avg  4038
> > name:                       rcuop/21 pid:        170 woken-up     2     interval: min 13311438  max 13311438    avg 6655719
> > name:                       rcuog/16 pid:        134 woken-up     4     interval: min  8029     max 13303387    avg 3329863
> > name:                        rcuop/9 pid:         85 woken-up     4     interval: min 10007570  max 10007586    avg 7505684
> > name:                        rcuog/8 pid:         77 woken-up     8     interval: min  6240     max 10001242    avg 3753622
> > name:                    rcu_preempt pid:         15 woken-up    18     interval: min  6058     max 9999713     avg 2140788
> > name:                     test_rcu/0 pid:       1411 woken-up 10003     interval: min   165     max 19072       avg  4275
> > root@pc638:/home/urezki/rcu_v6#
> > 
> > root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_default.script | sort -nk 6 | grep rcu
> > name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/51 pid:        384 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuog/32 pid:        248 woken-up     2     interval: min 11927     max 11927       avg  5963
> > name:                       rcuop/63 pid:        469 woken-up     2     interval: min 23963     max 23963       avg 11981
> > name:                       rcuog/56 pid:        419 woken-up     3     interval: min 11132     max 23967       avg 11699
> > name:                       rcuop/50 pid:        377 woken-up     3     interval: min  8057     max 4944344     avg 1650800
> > name:                       rcuog/48 pid:        362 woken-up     8     interval: min  2712     max 37430015    avg 5298801
> > name:                       rcuop/16 pid:        135 woken-up  4790     interval: min  7340     max 16649       avg  8843
> > name:                       rcuog/16 pid:        134 woken-up  4792     interval: min  7368     max 16644       avg  8844
> > name:                    rcu_preempt pid:         15 woken-up  5302     interval: min    26     max 12179       avg  7994
> > name:                     test_rcu/0 pid:       1353 woken-up 10003     interval: min   169     max 18508       avg  4236
> > root@pc638:/home/urezki/rcu_v6#
> > <snip>
> > 
> > so it is obvious that the patch does the job.
> 
> Thanks a lot for testing!
> 
> > On my KVM machine the boot time is affected:
> > 
> > <snip>
> > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > [  104.340193] systemd[1]: Detected virtualization kvm.
> > [  104.340196] systemd[1]: Detected architecture x86-64.
> > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > [  105.740109] random: crng init done
> > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > <snip>
> > 
> > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > be waiting for "RCU" in a sync way.
> 
> I was wondering if you can compare boot logs and see which timestamp does the
> slow down start from. That way, we can narrow down the callback. Also another
> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> ftrace_dump_on_oops" to the boot params, and then manually call
> "tracing_off(); panic();" from the code at the first printk that seems off in
> your comparison of good vs bad. For example, if "crng init done" timestamp is
> off, put the "tracing_off(); panic();" there. Then grab the serial console
> output to see what were the last callbacks that was queued/invoked.

We do seem to be in need of some way to quickly and easily locate the
callback that needed to be _flush() due to a wakeup.

Might one more proactive approach be to use Coccinelle to locate such
callback functions?  We might not want -all- callbacks that do wakeups
to use call_rcu_flush(), but knowing which are which should speed up
slow-boot debugging by quite a bit.

Or is there a better way to do this?

							Thanx, Paul

> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 08605ce7379d..40ae36904825 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> > >  
> > >  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > >  
> > > +#ifdef CONFIG_RCU_LAZY
> > > +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> > > +#else
> > > +static inline void call_rcu_flush(struct rcu_head *head,
> > > +		rcu_callback_t func) {  call_rcu(head, func); }
> > > +#endif
> > > +
> > >  /* Internal to kernel */
> > >  void rcu_init(void);
> > >  extern int rcu_scheduler_active;
> > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > index f53ad63b2bc6..edd632e68497 100644
> > > --- a/kernel/rcu/Kconfig
> > > +++ b/kernel/rcu/Kconfig
> > > @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> > >  	  Say N here if you hate read-side memory barriers.
> > >  	  Take the default if you are unsure.
> > >  
> > > +config RCU_LAZY
> > > +	bool "RCU callback lazy invocation functionality"
> > > +	depends on RCU_NOCB_CPU
> > > +	default n
> > > +	help
> > > +	  To save power, batch RCU callbacks and flush after delay, memory
> > > +	  pressure or callback list growing too big.
> > > +
> > >
> > Do you think you need this kernel option? Can we just consider and make
> > it a run-time configurable? For example much more users will give it a try,
> > so it will increase a coverage. By default it can be off.
> > 
> > Also you do not need to do:
> > 
> > #ifdef LAZY
> 
> How does the "LAZY" macro end up being runtime-configurable? That's static /
> compile time. Did I miss something?
> 
> > ...
> > #else
> > ...
> > #endif
> > 
> > >  
> > > +/*
> > > + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> > > + * can elapse before lazy callbacks are flushed. Lazy callbacks
> > > + * could be flushed much earlier for a number of other reasons
> > > + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> > > + * left unsubmitted to RCU after those many jiffies.
> > > + */
> > > +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> > > +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> > Make it configurable? I do not think you actually need 10 seconds here.
> > Reducing it will reduce a possibility to hit a low memory condition. 1
> > second would be far enough i think.
> 
> Hmm, I can make the delay configurable but for now I'll keep this as default
> as all of our power testing has been done with that and I don't want risk
> losing the optimization.
> 
> Honestly, I am not worried too about memory pressure as we have a shrinker
> which triggers flushes on the slightest hint of memory pressure. If it is not
> handling it properly, then we need to fix the shrinker.
> 
> thanks,
> 
>  - Joel
>

Uladzislau Rezki Sept. 26, 2022, 7:32 p.m. UTC | #18

On Mon, Sep 26, 2022 at 10:48:46AM -0700, Paul E. McKenney wrote:
> On Sun, Sep 25, 2022 at 05:46:53PM +0000, Joel Fernandes wrote:
> > Hi Vlad,
> > 
> > On Sun, Sep 25, 2022 at 10:57:10AM +0200, Uladzislau Rezki wrote:
> > > > Implement timer-based RCU lazy callback batching. The batch is flushed
> > > > whenever a certain amount of time has passed, or the batch on a
> > > > particular CPU grows too big. Also memory pressure will flush it in a
> > > > future patch.
> > > > 
> > > > To handle several corner cases automagically (such as rcu_barrier() and
> > > > hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> > > > length has the lazy CB length included in it. A separate lazy CB length
> > > > counter is also introduced to keep track of the number of lazy CBs.
> > > > 
> > > > v5->v6:
> > > > 
> > > > [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> > > >   deferral levels wake much earlier so for those it is not needed. ]
> > > > 
> > > > [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> > > > 
> > > > [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> > > > 
> > > > [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> > > > 
> > > > [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> > > > 
> > > > [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> > > > 
> > > I think it make sense to add some data to the commit message
> > > illustrating what this patch does.
> > 
> > Sure, will do!
> > 
> > > From my side i gave a try of this patch on my setup. Some data:
> > > 
> > > <snip>
> > > root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_v6.script | sort -nk 6 | grep rcu
> > > name:                       rcuop/23 pid:        184 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/26 pid:        206 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/29 pid:        227 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                        rcuop/2 pid:         35 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/34 pid:        263 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/35 pid:        270 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/36 pid:        277 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/37 pid:        284 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/38 pid:        291 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/49 pid:        370 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/59 pid:        441 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/63 pid:        469 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                        rcuog/0 pid:         16 woken-up     2     interval: min  8034     max  8034       avg  4017
> > > name:                       rcuog/24 pid:        191 woken-up     2     interval: min  7941     max  7941       avg  3970
> > > name:                       rcuog/32 pid:        248 woken-up     2     interval: min  7542     max  7542       avg  3771
> > > name:                       rcuog/48 pid:        362 woken-up     2     interval: min  8065     max  8065       avg  4032
> > > name:                       rcuog/56 pid:        419 woken-up     2     interval: min  8076     max  8076       avg  4038
> > > name:                       rcuop/21 pid:        170 woken-up     2     interval: min 13311438  max 13311438    avg 6655719
> > > name:                       rcuog/16 pid:        134 woken-up     4     interval: min  8029     max 13303387    avg 3329863
> > > name:                        rcuop/9 pid:         85 woken-up     4     interval: min 10007570  max 10007586    avg 7505684
> > > name:                        rcuog/8 pid:         77 woken-up     8     interval: min  6240     max 10001242    avg 3753622
> > > name:                    rcu_preempt pid:         15 woken-up    18     interval: min  6058     max 9999713     avg 2140788
> > > name:                     test_rcu/0 pid:       1411 woken-up 10003     interval: min   165     max 19072       avg  4275
> > > root@pc638:/home/urezki/rcu_v6#
> > > 
> > > root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_default.script | sort -nk 6 | grep rcu
> > > name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuop/51 pid:        384 woken-up     1     interval: min     0     max     0       avg     0
> > > name:                       rcuog/32 pid:        248 woken-up     2     interval: min 11927     max 11927       avg  5963
> > > name:                       rcuop/63 pid:        469 woken-up     2     interval: min 23963     max 23963       avg 11981
> > > name:                       rcuog/56 pid:        419 woken-up     3     interval: min 11132     max 23967       avg 11699
> > > name:                       rcuop/50 pid:        377 woken-up     3     interval: min  8057     max 4944344     avg 1650800
> > > name:                       rcuog/48 pid:        362 woken-up     8     interval: min  2712     max 37430015    avg 5298801
> > > name:                       rcuop/16 pid:        135 woken-up  4790     interval: min  7340     max 16649       avg  8843
> > > name:                       rcuog/16 pid:        134 woken-up  4792     interval: min  7368     max 16644       avg  8844
> > > name:                    rcu_preempt pid:         15 woken-up  5302     interval: min    26     max 12179       avg  7994
> > > name:                     test_rcu/0 pid:       1353 woken-up 10003     interval: min   169     max 18508       avg  4236
> > > root@pc638:/home/urezki/rcu_v6#
> > > <snip>
> > > 
> > > so it is obvious that the patch does the job.
> > 
> > Thanks a lot for testing!
> > 
> > > On my KVM machine the boot time is affected:
> > > 
> > > <snip>
> > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > [  105.740109] random: crng init done
> > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > <snip>
> > > 
> > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > be waiting for "RCU" in a sync way.
> > 
> > I was wondering if you can compare boot logs and see which timestamp does the
> > slow down start from. That way, we can narrow down the callback. Also another
> > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > ftrace_dump_on_oops" to the boot params, and then manually call
> > "tracing_off(); panic();" from the code at the first printk that seems off in
> > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > output to see what were the last callbacks that was queued/invoked.
> 
> We do seem to be in need of some way to quickly and easily locate the
> callback that needed to be _flush() due to a wakeup.
>
<snip>
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeea9731ef80..fe1146d97f1a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
 
        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
                rwork->wq = wq;
-               call_rcu(&rwork->rcu, rcu_work_rcufn);
+               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
                return true;
        }
 
<snip>

?

But it does not fully solve my boot-up issue. Will debug tomorrow further.

> Might one more proactive approach be to use Coccinelle to locate such
> callback functions?  We might not want -all- callbacks that do wakeups
> to use call_rcu_flush(), but knowing which are which should speed up
> slow-boot debugging by quite a bit.
> 
> Or is there a better way to do this?
> 
I am not sure what Coccinelle is. If we had something automated that measures
a boot time and if needed does some profiling it would be good. Otherwise it
is a manual debugging mainly, IMHO.

--
Uladzislau Rezki

Uladzislau Rezki Sept. 26, 2022, 7:39 p.m. UTC | #19

On Sun, Sep 25, 2022 at 05:46:53PM +0000, Joel Fernandes wrote:
> Hi Vlad,
> 
> On Sun, Sep 25, 2022 at 10:57:10AM +0200, Uladzislau Rezki wrote:
> > > Implement timer-based RCU lazy callback batching. The batch is flushed
> > > whenever a certain amount of time has passed, or the batch on a
> > > particular CPU grows too big. Also memory pressure will flush it in a
> > > future patch.
> > > 
> > > To handle several corner cases automagically (such as rcu_barrier() and
> > > hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> > > length has the lazy CB length included in it. A separate lazy CB length
> > > counter is also introduced to keep track of the number of lazy CBs.
> > > 
> > > v5->v6:
> > > 
> > > [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> > >   deferral levels wake much earlier so for those it is not needed. ]
> > > 
> > > [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> > > 
> > > [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> > > 
> > > [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> > > 
> > > [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> > > 
> > > [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> > > 
> > I think it make sense to add some data to the commit message
> > illustrating what this patch does.
> 
> Sure, will do!
> 
> > From my side i gave a try of this patch on my setup. Some data:
> > 
> > <snip>
> > root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_v6.script | sort -nk 6 | grep rcu
> > name:                       rcuop/23 pid:        184 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/26 pid:        206 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/29 pid:        227 woken-up     1     interval: min     0     max     0       avg     0
> > name:                        rcuop/2 pid:         35 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/34 pid:        263 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/35 pid:        270 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/36 pid:        277 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/37 pid:        284 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/38 pid:        291 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/49 pid:        370 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/59 pid:        441 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/63 pid:        469 woken-up     1     interval: min     0     max     0       avg     0
> > name:                        rcuog/0 pid:         16 woken-up     2     interval: min  8034     max  8034       avg  4017
> > name:                       rcuog/24 pid:        191 woken-up     2     interval: min  7941     max  7941       avg  3970
> > name:                       rcuog/32 pid:        248 woken-up     2     interval: min  7542     max  7542       avg  3771
> > name:                       rcuog/48 pid:        362 woken-up     2     interval: min  8065     max  8065       avg  4032
> > name:                       rcuog/56 pid:        419 woken-up     2     interval: min  8076     max  8076       avg  4038
> > name:                       rcuop/21 pid:        170 woken-up     2     interval: min 13311438  max 13311438    avg 6655719
> > name:                       rcuog/16 pid:        134 woken-up     4     interval: min  8029     max 13303387    avg 3329863
> > name:                        rcuop/9 pid:         85 woken-up     4     interval: min 10007570  max 10007586    avg 7505684
> > name:                        rcuog/8 pid:         77 woken-up     8     interval: min  6240     max 10001242    avg 3753622
> > name:                    rcu_preempt pid:         15 woken-up    18     interval: min  6058     max 9999713     avg 2140788
> > name:                     test_rcu/0 pid:       1411 woken-up 10003     interval: min   165     max 19072       avg  4275
> > root@pc638:/home/urezki/rcu_v6#
> > 
> > root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_default.script | sort -nk 6 | grep rcu
> > name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuop/51 pid:        384 woken-up     1     interval: min     0     max     0       avg     0
> > name:                       rcuog/32 pid:        248 woken-up     2     interval: min 11927     max 11927       avg  5963
> > name:                       rcuop/63 pid:        469 woken-up     2     interval: min 23963     max 23963       avg 11981
> > name:                       rcuog/56 pid:        419 woken-up     3     interval: min 11132     max 23967       avg 11699
> > name:                       rcuop/50 pid:        377 woken-up     3     interval: min  8057     max 4944344     avg 1650800
> > name:                       rcuog/48 pid:        362 woken-up     8     interval: min  2712     max 37430015    avg 5298801
> > name:                       rcuop/16 pid:        135 woken-up  4790     interval: min  7340     max 16649       avg  8843
> > name:                       rcuog/16 pid:        134 woken-up  4792     interval: min  7368     max 16644       avg  8844
> > name:                    rcu_preempt pid:         15 woken-up  5302     interval: min    26     max 12179       avg  7994
> > name:                     test_rcu/0 pid:       1353 woken-up 10003     interval: min   169     max 18508       avg  4236
> > root@pc638:/home/urezki/rcu_v6#
> > <snip>
> > 
> > so it is obvious that the patch does the job.
> 
> Thanks a lot for testing!
> 
> > On my KVM machine the boot time is affected:
> > 
> > <snip>
> > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > [  104.340193] systemd[1]: Detected virtualization kvm.
> > [  104.340196] systemd[1]: Detected architecture x86-64.
> > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > [  105.740109] random: crng init done
> > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > <snip>
> > 
> > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > be waiting for "RCU" in a sync way.
> 
> I was wondering if you can compare boot logs and see which timestamp does the
> slow down start from. That way, we can narrow down the callback. Also another
> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> ftrace_dump_on_oops" to the boot params, and then manually call
> "tracing_off(); panic();" from the code at the first printk that seems off in
> your comparison of good vs bad. For example, if "crng init done" timestamp is
> off, put the "tracing_off(); panic();" there. Then grab the serial console
> output to see what were the last callbacks that was queued/invoked.
> 
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index 08605ce7379d..40ae36904825 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> > >  
> > >  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > >  
> > > +#ifdef CONFIG_RCU_LAZY
> > > +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> > > +#else
> > > +static inline void call_rcu_flush(struct rcu_head *head,
> > > +		rcu_callback_t func) {  call_rcu(head, func); }
> > > +#endif
> > > +
> > >  /* Internal to kernel */
> > >  void rcu_init(void);
> > >  extern int rcu_scheduler_active;
> > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > index f53ad63b2bc6..edd632e68497 100644
> > > --- a/kernel/rcu/Kconfig
> > > +++ b/kernel/rcu/Kconfig
> > > @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> > >  	  Say N here if you hate read-side memory barriers.
> > >  	  Take the default if you are unsure.
> > >  
> > > +config RCU_LAZY
> > > +	bool "RCU callback lazy invocation functionality"
> > > +	depends on RCU_NOCB_CPU
> > > +	default n
> > > +	help
> > > +	  To save power, batch RCU callbacks and flush after delay, memory
> > > +	  pressure or callback list growing too big.
> > > +
> > >
> > Do you think you need this kernel option? Can we just consider and make
> > it a run-time configurable? For example much more users will give it a try,
> > so it will increase a coverage. By default it can be off.
> > 
> > Also you do not need to do:
> > 
> > #ifdef LAZY
> 
> How does the "LAZY" macro end up being runtime-configurable? That's static /
> compile time. Did I miss something?
> 
I am talking about removing if:

config RCU_LAZY

we might run into issues related to run-time switching though.

> > ...
> > #else
> > ...
> > #endif
> > 
> > >  
> > > +/*
> > > + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> > > + * can elapse before lazy callbacks are flushed. Lazy callbacks
> > > + * could be flushed much earlier for a number of other reasons
> > > + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> > > + * left unsubmitted to RCU after those many jiffies.
> > > + */
> > > +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> > > +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> > Make it configurable? I do not think you actually need 10 seconds here.
> > Reducing it will reduce a possibility to hit a low memory condition. 1
> > second would be far enough i think.
> 
> Hmm, I can make the delay configurable but for now I'll keep this as default
> as all of our power testing has been done with that and I don't want risk
> losing the optimization.
>
Fine to me. Later on is OK.

> 
> Honestly, I am not worried too about memory pressure as we have a shrinker
> which triggers flushes on the slightest hint of memory pressure. If it is not
> handling it properly, then we need to fix the shrinker.
> 
Will not speculate here since i have not tested this patch enough with
different time out and a low mem. condition.

--
Uladzislau Rezki

Joel Fernandes Sept. 26, 2022, 8:54 p.m. UTC | #20

Hi Vlad,

On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
[...]
> > > On my KVM machine the boot time is affected:
> > > 
> > > <snip>
> > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > [  105.740109] random: crng init done
> > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > <snip>
> > > 
> > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > be waiting for "RCU" in a sync way.
> > 
> > I was wondering if you can compare boot logs and see which timestamp does the
> > slow down start from. That way, we can narrow down the callback. Also another
> > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > ftrace_dump_on_oops" to the boot params, and then manually call
> > "tracing_off(); panic();" from the code at the first printk that seems off in
> > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > output to see what were the last callbacks that was queued/invoked.

Would you be willing to try these steps? Meanwhile I will try on my side as
well with the .config you sent me in another email.

> > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > index 08605ce7379d..40ae36904825 100644
> > > > --- a/include/linux/rcupdate.h
> > > > +++ b/include/linux/rcupdate.h
> > > > @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> > > >  
> > > >  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > >  
> > > > +#ifdef CONFIG_RCU_LAZY
> > > > +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> > > > +#else
> > > > +static inline void call_rcu_flush(struct rcu_head *head,
> > > > +		rcu_callback_t func) {  call_rcu(head, func); }
> > > > +#endif
> > > > +
> > > >  /* Internal to kernel */
> > > >  void rcu_init(void);
> > > >  extern int rcu_scheduler_active;
> > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > index f53ad63b2bc6..edd632e68497 100644
> > > > --- a/kernel/rcu/Kconfig
> > > > +++ b/kernel/rcu/Kconfig
> > > > @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> > > >  	  Say N here if you hate read-side memory barriers.
> > > >  	  Take the default if you are unsure.
> > > >  
> > > > +config RCU_LAZY
> > > > +	bool "RCU callback lazy invocation functionality"
> > > > +	depends on RCU_NOCB_CPU
> > > > +	default n
> > > > +	help
> > > > +	  To save power, batch RCU callbacks and flush after delay, memory
> > > > +	  pressure or callback list growing too big.
> > > > +
> > > >
> > > Do you think you need this kernel option? Can we just consider and make
> > > it a run-time configurable? For example much more users will give it a try,
> > > so it will increase a coverage. By default it can be off.
> > > 
> > > Also you do not need to do:
> > > 
> > > #ifdef LAZY
> > 
> > How does the "LAZY" macro end up being runtime-configurable? That's static /
> > compile time. Did I miss something?
> > 
> I am talking about removing if:
> 
> config RCU_LAZY
> 
> we might run into issues related to run-time switching though.

When we started off, Paul said he wanted it kernel CONFIGurable. I will defer
to Paul on a decision for that. I prefer kernel CONFIG so people don't forget
to pass a boot param.

thanks,

 - Joel

Joel Fernandes Sept. 26, 2022, 9:02 p.m. UTC | #21

On Mon, Sep 26, 2022 at 09:32:44PM +0200, Uladzislau Rezki wrote:
[...]
> > > > On my KVM machine the boot time is affected:
> > > > 
> > > > <snip>
> > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > [  105.740109] random: crng init done
> > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > <snip>
> > > > 
> > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > be waiting for "RCU" in a sync way.
> > > 
> > > I was wondering if you can compare boot logs and see which timestamp does the
> > > slow down start from. That way, we can narrow down the callback. Also another
> > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > output to see what were the last callbacks that was queued/invoked.
> > 
> > We do seem to be in need of some way to quickly and easily locate the
> > callback that needed to be _flush() due to a wakeup.
> >
> <snip>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index aeea9731ef80..fe1146d97f1a 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
>  
>         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>                 rwork->wq = wq;
> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
>                 return true;
>         }
>  
> <snip>
> 
> ?
> 
> But it does not fully solve my boot-up issue. Will debug tomorrow further.

Ah, but at least its progress, thanks. Could you send me a patch to include
in the next revision with details of this?

> > Might one more proactive approach be to use Coccinelle to locate such
> > callback functions?  We might not want -all- callbacks that do wakeups
> > to use call_rcu_flush(), but knowing which are which should speed up
> > slow-boot debugging by quite a bit.
> > 
> > Or is there a better way to do this?
> > 
> I am not sure what Coccinelle is. If we had something automated that measures
> a boot time and if needed does some profiling it would be good. Otherwise it
> is a manual debugging mainly, IMHO.

Paul, What about using a default-off kernel CONFIG that splats on all lazy
call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
in kernel I think. I can talk to Steve to get ideas on how to do that but I
think it can be done purely from trace events (we might need a new
trace_end_invoke_callback to fire after the callback is invoked). Thoughts?

thanks,

 - Joel

Joel Fernandes Sept. 26, 2022, 9:07 p.m. UTC | #22

Hi Paul,

On Mon, Sep 26, 2022 at 10:42:40AM -0700, Paul E. McKenney wrote:
[..]
> > > > >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > > >> +    } else {
> > > > >> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > > > >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > > > 
> > > > > This WRITE_ONCE() can be dropped out of the "if" statement, correct?
> > > > 
> > > > Yes will update.
> > > 
> > > Thank you!
> > > 
> > > > > If so, this could be an "if" statement with two statements in its "then"
> > > > > clause, no "else" clause, and two statements following the "if" statement.
> > > > 
> > > > I don’t think we can get rid of the else part but I’ll see what it looks like.
> > > 
> > > In the function header, s/rhp/rhp_in/, then:
> > > 
> > > 	struct rcu_head *rhp = rhp_in;
> > > 
> > > And then:
> > > 
> > > 	if (lazy && rhp) {
> > > 		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> > > 		rhp = NULL;
> > 
> > This enqueues on to the bypass list, where as if lazy && rhp, I want to queue
> > the new rhp on to the main cblist. So the pseudo code in my patch is:
> > 
> > if (lazy and rhp) then
> > 	1. flush bypass CBs on to main list.
> > 	2. queue new CB on to main list.
> 
> And the difference is here, correct?  I enqueue to the bypass list,
> which is then flushed (in order) to the main list.  In contrast, you
> flush the bypass list, then enqueue to the main list.  Either way,
> the callback referenced by rhp ends up at the end of ->cblist.
> 
> Or am I on the wrong branch of this "if" statement?

But we have to flush first, and then queue the new one. Otherwise wouldn't
the callbacks be invoked out of order? Or did I miss something?

> > else
> > 	1. flush bypass CBs on to main list
> > 	2. queue new CB on to bypass list.
> > 
> > > 	}
> > > 	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > > 	WRITE_ONCE(rdp->lazy_len, 0);
> > > 
> > > Or did I mess something up?
> > 
> > So the rcu_cblist_flush_enqueue() has to happen before the
> > rcu_cblist_enqueue() to preserve the ordering of flushing into the main list,
> > and queuing on to the main list for the "if". Where as in your snip, the
> > order is reversed.
> 
> Did I pick the correct branch of the "if" statement above?  Or were you
> instead talking about the "else" clause?
> 
> I would have been more worried about getting cblist->len right.

Hmm, I think my concern was more the ordering of callbacks, and moving the
write to length should be Ok.

> > If I consolidate it then, it looks like the following. However, it is a bit
> > more unreadable. I could instead just take the WRITE_ONCE out of both if/else
> > and move it to after the if/else, that would be cleanest. Does that sound
> > good to you? Thanks!
> 
> Let's first figure out whether or not we are talking past one another.  ;-)

Haha yeah :-)

thanks,

 - Joel


> 
> 							Thanx, Paul
> 
> > ---8<-----------------------
> > 
> > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> > index 1a182b9c4f6c..bd3f54d314e8 100644
> > --- a/kernel/rcu/tree_nocb.h
> > +++ b/kernel/rcu/tree_nocb.h
> > @@ -327,10 +327,11 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
> >   *
> >   * Note that this function always returns true if rhp is NULL.
> >   */
> > -static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > +static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp_in,
> >  				     unsigned long j, unsigned long flush_flags)
> >  {
> >  	struct rcu_cblist rcl;
> > +	struct rcu_head *rhp = rhp_in;
> >  	bool lazy = flush_flags & FLUSH_BP_LAZY;
> >  
> >  	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
> > @@ -348,14 +349,13 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >  	 * If the new CB requested was a lazy one, queue it onto the main
> >  	 * ->cblist so we can take advantage of a sooner grade period.
> >  	 */
> > -	if (lazy && rhp) {
> > -		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> > -		rcu_cblist_enqueue(&rcl, rhp);
> > -		WRITE_ONCE(rdp->lazy_len, 0);
> > -	} else {
> > -		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > -		WRITE_ONCE(rdp->lazy_len, 0);
> > -	}
> > +	if (lazy && rhp)
> > +		rhp = NULL;
> > +	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > +	if (lazy && rhp_in)
> > +		rcu_cblist_enqueue(&rcl, rhp_in);
> > +
> > +	WRITE_ONCE(rdp->lazy_len, 0);
> >  
> >  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
> >  	WRITE_ONCE(rdp->nocb_bypass_first, j);

Paul E. McKenney Sept. 26, 2022, 10:27 p.m. UTC | #23

On Mon, Sep 26, 2022 at 09:32:44PM +0200, Uladzislau Rezki wrote:
> On Mon, Sep 26, 2022 at 10:48:46AM -0700, Paul E. McKenney wrote:
> > On Sun, Sep 25, 2022 at 05:46:53PM +0000, Joel Fernandes wrote:
> > > Hi Vlad,
> > > 
> > > On Sun, Sep 25, 2022 at 10:57:10AM +0200, Uladzislau Rezki wrote:
> > > > > Implement timer-based RCU lazy callback batching. The batch is flushed
> > > > > whenever a certain amount of time has passed, or the batch on a
> > > > > particular CPU grows too big. Also memory pressure will flush it in a
> > > > > future patch.
> > > > > 
> > > > > To handle several corner cases automagically (such as rcu_barrier() and
> > > > > hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> > > > > length has the lazy CB length included in it. A separate lazy CB length
> > > > > counter is also introduced to keep track of the number of lazy CBs.
> > > > > 
> > > > > v5->v6:
> > > > > 
> > > > > [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> > > > >   deferral levels wake much earlier so for those it is not needed. ]
> > > > > 
> > > > > [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> > > > > 
> > > > > [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> > > > > 
> > > > > [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> > > > > 
> > > > > [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> > > > > 
> > > > > [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> > > > > 
> > > > I think it make sense to add some data to the commit message
> > > > illustrating what this patch does.
> > > 
> > > Sure, will do!
> > > 
> > > > From my side i gave a try of this patch on my setup. Some data:
> > > > 
> > > > <snip>
> > > > root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_v6.script | sort -nk 6 | grep rcu
> > > > name:                       rcuop/23 pid:        184 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/26 pid:        206 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/29 pid:        227 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                        rcuop/2 pid:         35 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/34 pid:        263 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/35 pid:        270 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/36 pid:        277 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/37 pid:        284 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/38 pid:        291 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/49 pid:        370 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/59 pid:        441 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/63 pid:        469 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                        rcuog/0 pid:         16 woken-up     2     interval: min  8034     max  8034       avg  4017
> > > > name:                       rcuog/24 pid:        191 woken-up     2     interval: min  7941     max  7941       avg  3970
> > > > name:                       rcuog/32 pid:        248 woken-up     2     interval: min  7542     max  7542       avg  3771
> > > > name:                       rcuog/48 pid:        362 woken-up     2     interval: min  8065     max  8065       avg  4032
> > > > name:                       rcuog/56 pid:        419 woken-up     2     interval: min  8076     max  8076       avg  4038
> > > > name:                       rcuop/21 pid:        170 woken-up     2     interval: min 13311438  max 13311438    avg 6655719
> > > > name:                       rcuog/16 pid:        134 woken-up     4     interval: min  8029     max 13303387    avg 3329863
> > > > name:                        rcuop/9 pid:         85 woken-up     4     interval: min 10007570  max 10007586    avg 7505684
> > > > name:                        rcuog/8 pid:         77 woken-up     8     interval: min  6240     max 10001242    avg 3753622
> > > > name:                    rcu_preempt pid:         15 woken-up    18     interval: min  6058     max 9999713     avg 2140788
> > > > name:                     test_rcu/0 pid:       1411 woken-up 10003     interval: min   165     max 19072       avg  4275
> > > > root@pc638:/home/urezki/rcu_v6#
> > > > 
> > > > root@pc638:/home/urezki/rcu_v6# ./perf_script_parser ./perf_default.script | sort -nk 6 | grep rcu
> > > > name:                       rcuop/33 pid:        256 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuop/51 pid:        384 woken-up     1     interval: min     0     max     0       avg     0
> > > > name:                       rcuog/32 pid:        248 woken-up     2     interval: min 11927     max 11927       avg  5963
> > > > name:                       rcuop/63 pid:        469 woken-up     2     interval: min 23963     max 23963       avg 11981
> > > > name:                       rcuog/56 pid:        419 woken-up     3     interval: min 11132     max 23967       avg 11699
> > > > name:                       rcuop/50 pid:        377 woken-up     3     interval: min  8057     max 4944344     avg 1650800
> > > > name:                       rcuog/48 pid:        362 woken-up     8     interval: min  2712     max 37430015    avg 5298801
> > > > name:                       rcuop/16 pid:        135 woken-up  4790     interval: min  7340     max 16649       avg  8843
> > > > name:                       rcuog/16 pid:        134 woken-up  4792     interval: min  7368     max 16644       avg  8844
> > > > name:                    rcu_preempt pid:         15 woken-up  5302     interval: min    26     max 12179       avg  7994
> > > > name:                     test_rcu/0 pid:       1353 woken-up 10003     interval: min   169     max 18508       avg  4236
> > > > root@pc638:/home/urezki/rcu_v6#
> > > > <snip>
> > > > 
> > > > so it is obvious that the patch does the job.
> > > 
> > > Thanks a lot for testing!
> > > 
> > > > On my KVM machine the boot time is affected:
> > > > 
> > > > <snip>
> > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > [  105.740109] random: crng init done
> > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > <snip>
> > > > 
> > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > be waiting for "RCU" in a sync way.
> > > 
> > > I was wondering if you can compare boot logs and see which timestamp does the
> > > slow down start from. That way, we can narrow down the callback. Also another
> > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > output to see what were the last callbacks that was queued/invoked.
> > 
> > We do seem to be in need of some way to quickly and easily locate the
> > callback that needed to be _flush() due to a wakeup.
> >
> <snip>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index aeea9731ef80..fe1146d97f1a 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
>  
>         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>                 rwork->wq = wq;
> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
>                 return true;
>         }
>  
> <snip>
> 
> ?

This one does seem like a good candidate.  Someday there might need to
be a queue_rcu_work_flush() vs. queue_rcu_work(), but that needs to be
proven.

> But it does not fully solve my boot-up issue. Will debug tomorrow further.

Sounds good!

> > Might one more proactive approach be to use Coccinelle to locate such
> > callback functions?  We might not want -all- callbacks that do wakeups
> > to use call_rcu_flush(), but knowing which are which should speed up
> > slow-boot debugging by quite a bit.
> > 
> > Or is there a better way to do this?
> > 
> I am not sure what Coccinelle is. If we had something automated that measures
> a boot time and if needed does some profiling it would be good. Otherwise it
> is a manual debugging mainly, IMHO.

Coccinelle is sort of like a variant of the "sed" command that understands
C syntax.  It is useful for searching for patterns in Linux-kernel source
code.

But if you are able to easily find the call_rcu() invocations that are
slowing things down, maybe it is not needed.  For now, anyway.

							Thanx, Paul

Paul E. McKenney Sept. 26, 2022, 10:32 p.m. UTC | #24

On Mon, Sep 26, 2022 at 09:02:21PM +0000, Joel Fernandes wrote:
> On Mon, Sep 26, 2022 at 09:32:44PM +0200, Uladzislau Rezki wrote:
> [...]
> > > > > On my KVM machine the boot time is affected:
> > > > > 
> > > > > <snip>
> > > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > > [  105.740109] random: crng init done
> > > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > > <snip>
> > > > > 
> > > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > > be waiting for "RCU" in a sync way.
> > > > 
> > > > I was wondering if you can compare boot logs and see which timestamp does the
> > > > slow down start from. That way, we can narrow down the callback. Also another
> > > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > > output to see what were the last callbacks that was queued/invoked.
> > > 
> > > We do seem to be in need of some way to quickly and easily locate the
> > > callback that needed to be _flush() due to a wakeup.
> > >
> > <snip>
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index aeea9731ef80..fe1146d97f1a 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> >  
> >         if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >                 rwork->wq = wq;
> > -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> >                 return true;
> >         }
> >  
> > <snip>
> > 
> > ?
> > 
> > But it does not fully solve my boot-up issue. Will debug tomorrow further.
> 
> Ah, but at least its progress, thanks. Could you send me a patch to include
> in the next revision with details of this?
> 
> > > Might one more proactive approach be to use Coccinelle to locate such
> > > callback functions?  We might not want -all- callbacks that do wakeups
> > > to use call_rcu_flush(), but knowing which are which should speed up
> > > slow-boot debugging by quite a bit.
> > > 
> > > Or is there a better way to do this?
> > > 
> > I am not sure what Coccinelle is. If we had something automated that measures
> > a boot time and if needed does some profiling it would be good. Otherwise it
> > is a manual debugging mainly, IMHO.
> 
> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> think it can be done purely from trace events (we might need a new
> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?

Could you look for wakeups invoked between trace_rcu_batch_start() and
trace_rcu_batch_end() that are not from interrupt context?  This would
of course need to be associated with a task rather than a CPU.

Note that you would need to check for wakeups from interrupt handlers
even with the extra trace_end_invoke_callback().  The window where an
interrupt handler could do a wakeup would be reduced, but not eliminated.

							Thanx, Paul

Paul E. McKenney Sept. 26, 2022, 10:35 p.m. UTC | #25

On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
> Hi Vlad,
> 
> On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
> [...]
> > > > On my KVM machine the boot time is affected:
> > > > 
> > > > <snip>
> > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > [  105.740109] random: crng init done
> > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > <snip>
> > > > 
> > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > be waiting for "RCU" in a sync way.
> > > 
> > > I was wondering if you can compare boot logs and see which timestamp does the
> > > slow down start from. That way, we can narrow down the callback. Also another
> > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > output to see what were the last callbacks that was queued/invoked.
> 
> Would you be willing to try these steps? Meanwhile I will try on my side as
> well with the .config you sent me in another email.
> 
> > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > > index 08605ce7379d..40ae36904825 100644
> > > > > --- a/include/linux/rcupdate.h
> > > > > +++ b/include/linux/rcupdate.h
> > > > > @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> > > > >  
> > > > >  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > > >  
> > > > > +#ifdef CONFIG_RCU_LAZY
> > > > > +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> > > > > +#else
> > > > > +static inline void call_rcu_flush(struct rcu_head *head,
> > > > > +		rcu_callback_t func) {  call_rcu(head, func); }
> > > > > +#endif
> > > > > +
> > > > >  /* Internal to kernel */
> > > > >  void rcu_init(void);
> > > > >  extern int rcu_scheduler_active;
> > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > > index f53ad63b2bc6..edd632e68497 100644
> > > > > --- a/kernel/rcu/Kconfig
> > > > > +++ b/kernel/rcu/Kconfig
> > > > > @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> > > > >  	  Say N here if you hate read-side memory barriers.
> > > > >  	  Take the default if you are unsure.
> > > > >  
> > > > > +config RCU_LAZY
> > > > > +	bool "RCU callback lazy invocation functionality"
> > > > > +	depends on RCU_NOCB_CPU
> > > > > +	default n
> > > > > +	help
> > > > > +	  To save power, batch RCU callbacks and flush after delay, memory
> > > > > +	  pressure or callback list growing too big.
> > > > > +
> > > > >
> > > > Do you think you need this kernel option? Can we just consider and make
> > > > it a run-time configurable? For example much more users will give it a try,
> > > > so it will increase a coverage. By default it can be off.
> > > > 
> > > > Also you do not need to do:
> > > > 
> > > > #ifdef LAZY
> > > 
> > > How does the "LAZY" macro end up being runtime-configurable? That's static /
> > > compile time. Did I miss something?
> > > 
> > I am talking about removing if:
> > 
> > config RCU_LAZY
> > 
> > we might run into issues related to run-time switching though.
> 
> When we started off, Paul said he wanted it kernel CONFIGurable. I will defer
> to Paul on a decision for that. I prefer kernel CONFIG so people don't forget
> to pass a boot param.

I am fine with a kernel boot parameter for this one.  You guys were the
ones preferring Kconfig options.  ;-)

But in that case, the CONFIG_RCU_NOCB_CPU would come into play to handle
the case where there is no bypass.

							Thanx, Paul

Paul E. McKenney Sept. 26, 2022, 10:37 p.m. UTC | #26

On Mon, Sep 26, 2022 at 09:07:12PM +0000, Joel Fernandes wrote:
> Hi Paul,
> 
> On Mon, Sep 26, 2022 at 10:42:40AM -0700, Paul E. McKenney wrote:
> [..]
> > > > > >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > > > >> +    } else {
> > > > > >> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > > > > >> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > > > > 
> > > > > > This WRITE_ONCE() can be dropped out of the "if" statement, correct?
> > > > > 
> > > > > Yes will update.
> > > > 
> > > > Thank you!
> > > > 
> > > > > > If so, this could be an "if" statement with two statements in its "then"
> > > > > > clause, no "else" clause, and two statements following the "if" statement.
> > > > > 
> > > > > I don’t think we can get rid of the else part but I’ll see what it looks like.
> > > > 
> > > > In the function header, s/rhp/rhp_in/, then:
> > > > 
> > > > 	struct rcu_head *rhp = rhp_in;
> > > > 
> > > > And then:
> > > > 
> > > > 	if (lazy && rhp) {
> > > > 		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> > > > 		rhp = NULL;
> > > 
> > > This enqueues on to the bypass list, where as if lazy && rhp, I want to queue
> > > the new rhp on to the main cblist. So the pseudo code in my patch is:
> > > 
> > > if (lazy and rhp) then
> > > 	1. flush bypass CBs on to main list.
> > > 	2. queue new CB on to main list.
> > 
> > And the difference is here, correct?  I enqueue to the bypass list,
> > which is then flushed (in order) to the main list.  In contrast, you
> > flush the bypass list, then enqueue to the main list.  Either way,
> > the callback referenced by rhp ends up at the end of ->cblist.
> > 
> > Or am I on the wrong branch of this "if" statement?
> 
> But we have to flush first, and then queue the new one. Otherwise wouldn't
> the callbacks be invoked out of order? Or did I miss something?

I don't think so...

We want the new callback to be last, right?  One way to do that is to
flush the bypass, then queue the new callback onto ->cblist.  Another way
to do that is to enqueue the new callback onto the end of the bypass,
then flush the bypass.  Why wouldn't these result in the same order?

> > > else
> > > 	1. flush bypass CBs on to main list
> > > 	2. queue new CB on to bypass list.
> > > 
> > > > 	}
> > > > 	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > > > 	WRITE_ONCE(rdp->lazy_len, 0);
> > > > 
> > > > Or did I mess something up?
> > > 
> > > So the rcu_cblist_flush_enqueue() has to happen before the
> > > rcu_cblist_enqueue() to preserve the ordering of flushing into the main list,
> > > and queuing on to the main list for the "if". Where as in your snip, the
> > > order is reversed.
> > 
> > Did I pick the correct branch of the "if" statement above?  Or were you
> > instead talking about the "else" clause?
> > 
> > I would have been more worried about getting cblist->len right.
> 
> Hmm, I think my concern was more the ordering of callbacks, and moving the
> write to length should be Ok.

OK, sounds good to me!  ;-)

> > > If I consolidate it then, it looks like the following. However, it is a bit
> > > more unreadable. I could instead just take the WRITE_ONCE out of both if/else
> > > and move it to after the if/else, that would be cleanest. Does that sound
> > > good to you? Thanks!
> > 
> > Let's first figure out whether or not we are talking past one another.  ;-)
> 
> Haha yeah :-)

So were we?  ;-)

							Thanx, Paul

Joel Fernandes Sept. 26, 2022, 11:33 p.m. UTC | #27

> On Sep 26, 2022, at 6:37 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Mon, Sep 26, 2022 at 09:07:12PM +0000, Joel Fernandes wrote:
>> Hi Paul,
>> 
>> On Mon, Sep 26, 2022 at 10:42:40AM -0700, Paul E. McKenney wrote:
>> [..]
>>>>>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
>>>>>>>> +    } else {
>>>>>>>> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>>>>>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
>>>>>>> 
>>>>>>> This WRITE_ONCE() can be dropped out of the "if" statement, correct?
>>>>>> 
>>>>>> Yes will update.
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>>>> If so, this could be an "if" statement with two statements in its "then"
>>>>>>> clause, no "else" clause, and two statements following the "if" statement.
>>>>>> 
>>>>>> I don’t think we can get rid of the else part but I’ll see what it looks like.
>>>>> 
>>>>> In the function header, s/rhp/rhp_in/, then:
>>>>> 
>>>>>    struct rcu_head *rhp = rhp_in;
>>>>> 
>>>>> And then:
>>>>> 
>>>>>    if (lazy && rhp) {
>>>>>        rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>>>>>        rhp = NULL;
>>>> 
>>>> This enqueues on to the bypass list, where as if lazy && rhp, I want to queue
>>>> the new rhp on to the main cblist. So the pseudo code in my patch is:
>>>> 
>>>> if (lazy and rhp) then
>>>>    1. flush bypass CBs on to main list.
>>>>    2. queue new CB on to main list.
>>> 
>>> And the difference is here, correct?  I enqueue to the bypass list,
>>> which is then flushed (in order) to the main list.  In contrast, you
>>> flush the bypass list, then enqueue to the main list.  Either way,
>>> the callback referenced by rhp ends up at the end of ->cblist.
>>> 
>>> Or am I on the wrong branch of this "if" statement?
>> 
>> But we have to flush first, and then queue the new one. Otherwise wouldn't
>> the callbacks be invoked out of order? Or did I miss something?
> 
> I don't think so...
> 
> We want the new callback to be last, right?  One way to do that is to
> flush the bypass, then queue the new callback onto ->cblist.  Another way
> to do that is to enqueue the new callback onto the end of the bypass,
> then flush the bypass.  Why wouldn't these result in the same order?

Yes you are right, sorry. I was fixated on the main list. Both your snippet and my patch will be equivalent then. However I find your snippet a bit confusing, as in it is not immediately obvious - why would we queue something on to a list, if we were about to flush it. But any way, it does make it a clever piece of code in some sense and I am ok with doing it this way ;-)

Thanks,

  - Joel


> 
>>>> else
>>>>    1. flush bypass CBs on to main list
>>>>    2. queue new CB on to bypass list.
>>>> 
>>>>>    }
>>>>>    rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>>>>>    WRITE_ONCE(rdp->lazy_len, 0);
>>>>> 
>>>>> Or did I mess something up?
>>>> 
>>>> So the rcu_cblist_flush_enqueue() has to happen before the
>>>> rcu_cblist_enqueue() to preserve the ordering of flushing into the main list,
>>>> and queuing on to the main list for the "if". Where as in your snip, the
>>>> order is reversed.
>>> 
>>> Did I pick the correct branch of the "if" statement above?  Or were you
>>> instead talking about the "else" clause?
>>> 
>>> I would have been more worried about getting cblist->len right.
>> 
>> Hmm, I think my concern was more the ordering of callbacks, and moving the
>> write to length should be Ok.
> 
> OK, sounds good to me!  ;-)
> 
>>>> If I consolidate it then, it looks like the following. However, it is a bit
>>>> more unreadable. I could instead just take the WRITE_ONCE out of both if/else
>>>> and move it to after the if/else, that would be cleanest. Does that sound
>>>> good to you? Thanks!
>>> 
>>> Let's first figure out whether or not we are talking past one another.  ;-)
>> 
>> Haha yeah :-)
> 
> So were we?  ;-)
> 
>                            Thanx, Paul

Joel Fernandes Sept. 26, 2022, 11:37 p.m. UTC | #28

> On Sep 26, 2022, at 1:33 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Mon, Sep 26, 2022 at 03:04:38PM +0000, Joel Fernandes wrote:
>>> On Mon, Sep 26, 2022 at 12:00:45AM +0200, Frederic Weisbecker wrote:
>>> On Sat, Sep 24, 2022 at 09:00:39PM -0400, Joel Fernandes wrote:
>>>> 
>>>> 
>>>>> On Sep 24, 2022, at 7:28 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>>> 
>>>>> Hi Frederic, thanks for the response, replies
>>>>> below courtesy fruit company’s device:
>>>>> 
>>>>>>> On Sep 24, 2022, at 6:46 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>>>>>>> 
>>>>>>> On Thu, Sep 22, 2022 at 10:01:01PM +0000, Joel Fernandes (Google) wrote:
>>>>>>> @@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>>>>>>  rdp->barrier_head.func = rcu_barrier_callback;
>>>>>>>  debug_rcu_head_queue(&rdp->barrier_head);
>>>>>>>  rcu_nocb_lock(rdp);
>>>>>>> -    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>>>>>>> +    /*
>>>>>>> +     * Flush the bypass list, but also wake up the GP thread as otherwise
>>>>>>> +     * bypass/lazy CBs maynot be noticed, and can cause real long delays!
>>>>>>> +     */
>>>>>>> +    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
>>>>>> 
>>>>>> This fixes an issue that goes beyond lazy implementation. It should be done
>>>>>> in a separate patch, handling rcu_segcblist_entrain() as well, with "Fixes: " tag.
>>>>> 
>>>>> I wanted to do that, however on discussion with
>>>>> Paul I thought of making this optimization only for
>>>>> all lazy bypass CBs. That makes it directly related
>>>>> this patch since the laziness notion is first
>>>>> introduced here. On the other hand I could make
>>>>> this change in a later patch since we are not
>>>>> super bisectable anyway courtesy of the last
>>>>> patch (which is not really an issue if the CONFIG
>>>>> is kept off during someone’s bisection.
>>>> 
>>>> Or are we saying it’s worth doing the wake up for rcu barrier even for
>>>> regular bypass CB? That’d save 2 jiffies on rcu barrier. If we agree it’s
>>>> needed, then yes splitting the patch makes sense.
>>>> 
>>>> Please let me know your opinions, thanks,
>>>> 
>>>> - Joel
>>> 
>>> Sure, I mean since we are fixing the buggy rcu_barrier_entrain() anyway, let's
>>> just fix bypass as well. Such as in the following (untested):
>> 
>> Got it. This sounds good to me, and will simplify the code a bit more for sure.
>> 
>> I guess a question for Paul - are you Ok with rcu_barrier() causing wake ups
>> if the bypass list has any non-lazy CBs as well? That should be OK, IMO.
> 
> In theory, I am OK with it.  In practice, you are the guys with the
> hardware that can measure power consumption, not me!  ;-)

Ok I’ll do it this way and I’ll add Frederic’s Suggested-by tag.  About power, I have already measured and it has no effect on power that I could find.

Thanks!

 - Joel



> 
>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>> index b39e97175a9e..a0df964abb0e 100644
>>> --- a/kernel/rcu/tree.c
>>> +++ b/kernel/rcu/tree.c
>>> @@ -3834,6 +3834,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>> {
>>>    unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
>>>    unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
>>> +    bool wake_nocb = false;
>>> +    bool was_alldone = false;
>>> 
>>>    lockdep_assert_held(&rcu_state.barrier_lock);
>>>    if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
>>> @@ -3842,6 +3844,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>>    rdp->barrier_head.func = rcu_barrier_callback;
>>>    debug_rcu_head_queue(&rdp->barrier_head);
>>>    rcu_nocb_lock(rdp);
>>> +    if (rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist))
>>> +        was_alldone = true;
>>>    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>>>    if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>>        atomic_inc(&rcu_state.barrier_cpu_count);
>>> @@ -3849,7 +3853,12 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>>        debug_rcu_head_unqueue(&rdp->barrier_head);
>>>        rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
>>>    }
>>> +    if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
>>> +        wake_nocb = true;
>>>    rcu_nocb_unlock(rdp);
>>> +    if (wake_nocb)
>>> +        wake_nocb_gp(rdp, false);
>>> +
>> 
>> Thanks for the code snippet, I like how you are checking if the bypass list
>> is empty, without actually checking it ;-)
> 
> That certainly is consistent with the RCU philosophy.  :-)
> 
>                            Thanx, Paul

Joel Fernandes Sept. 26, 2022, 11:44 p.m. UTC | #29

> On Sep 26, 2022, at 6:35 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
>> Hi Vlad,
>> 
>> On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
>> [...]
>>>>> On my KVM machine the boot time is affected:
>>>>> 
>>>>> <snip>
>>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
>>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
>>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
>>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
>>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
>>>>> [  104.115418] process '/usr/bin/fstype' started with executable stack
>>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
>>>>> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
>>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
>>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
>>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
>>>>> [  105.740109] random: crng init done
>>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
>>>>> <snip>
>>>>> 
>>>>> 2 - 11 and second delay is between 32 - 104. So there are still users which must
>>>>> be waiting for "RCU" in a sync way.
>>>> 
>>>> I was wondering if you can compare boot logs and see which timestamp does the
>>>> slow down start from. That way, we can narrow down the callback. Also another
>>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
>>>> ftrace_dump_on_oops" to the boot params, and then manually call
>>>> "tracing_off(); panic();" from the code at the first printk that seems off in
>>>> your comparison of good vs bad. For example, if "crng init done" timestamp is
>>>> off, put the "tracing_off(); panic();" there. Then grab the serial console
>>>> output to see what were the last callbacks that was queued/invoked.
>> 
>> Would you be willing to try these steps? Meanwhile I will try on my side as
>> well with the .config you sent me in another email.
>> 
>>>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>>>>> index 08605ce7379d..40ae36904825 100644
>>>>>> --- a/include/linux/rcupdate.h
>>>>>> +++ b/include/linux/rcupdate.h
>>>>>> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
>>>>>> 
>>>>>> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>>>>> 
>>>>>> +#ifdef CONFIG_RCU_LAZY
>>>>>> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
>>>>>> +#else
>>>>>> +static inline void call_rcu_flush(struct rcu_head *head,
>>>>>> +        rcu_callback_t func) {  call_rcu(head, func); }
>>>>>> +#endif
>>>>>> +
>>>>>> /* Internal to kernel */
>>>>>> void rcu_init(void);
>>>>>> extern int rcu_scheduler_active;
>>>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>>>>>> index f53ad63b2bc6..edd632e68497 100644
>>>>>> --- a/kernel/rcu/Kconfig
>>>>>> +++ b/kernel/rcu/Kconfig
>>>>>> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
>>>>>>      Say N here if you hate read-side memory barriers.
>>>>>>      Take the default if you are unsure.
>>>>>> 
>>>>>> +config RCU_LAZY
>>>>>> +    bool "RCU callback lazy invocation functionality"
>>>>>> +    depends on RCU_NOCB_CPU
>>>>>> +    default n
>>>>>> +    help
>>>>>> +      To save power, batch RCU callbacks and flush after delay, memory
>>>>>> +      pressure or callback list growing too big.
>>>>>> +
>>>>>> 
>>>>> Do you think you need this kernel option? Can we just consider and make
>>>>> it a run-time configurable? For example much more users will give it a try,
>>>>> so it will increase a coverage. By default it can be off.
>>>>> 
>>>>> Also you do not need to do:
>>>>> 
>>>>> #ifdef LAZY
>>>> 
>>>> How does the "LAZY" macro end up being runtime-configurable? That's static /
>>>> compile time. Did I miss something?
>>>> 
>>> I am talking about removing if:
>>> 
>>> config RCU_LAZY
>>> 
>>> we might run into issues related to run-time switching though.
>> 
>> When we started off, Paul said he wanted it kernel CONFIGurable. I will defer
>> to Paul on a decision for that. I prefer kernel CONFIG so people don't forget
>> to pass a boot param.
> 
> I am fine with a kernel boot parameter for this one.  You guys were the
> ones preferring Kconfig options.  ;-)

Yes I still prefer that.. ;-)

> But in that case, the CONFIG_RCU_NOCB_CPU would come into play to handle
> the case where there is no bypass.

If you don’t mind, let’s do both like we did for NOCB_CPU_ALL. In which case, Vlad since this was your suggestion, would you be so kind to send a patch adding a boot parameter on top of the series? ;-). I’ll include it in the next version. I’d suggest keep the boot param default off and add a CONFIG option that forces the boot param to be turned on.

Thanks,

 - Joel



> 
>                            Thanx, Paul

Joel Fernandes Sept. 26, 2022, 11:47 p.m. UTC | #30

> On Sep 26, 2022, at 6:32 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Mon, Sep 26, 2022 at 09:02:21PM +0000, Joel Fernandes wrote:
>> On Mon, Sep 26, 2022 at 09:32:44PM +0200, Uladzislau Rezki wrote:
>> [...]
>>>>>> On my KVM machine the boot time is affected:
>>>>>> 
>>>>>> <snip>
>>>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
>>>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
>>>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
>>>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
>>>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
>>>>>> [  104.115418] process '/usr/bin/fstype' started with executable stack
>>>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
>>>>>> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
>>>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
>>>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
>>>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
>>>>>> [  105.740109] random: crng init done
>>>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
>>>>>> <snip>
>>>>>> 
>>>>>> 2 - 11 and second delay is between 32 - 104. So there are still users which must
>>>>>> be waiting for "RCU" in a sync way.
>>>>> 
>>>>> I was wondering if you can compare boot logs and see which timestamp does the
>>>>> slow down start from. That way, we can narrow down the callback. Also another
>>>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
>>>>> ftrace_dump_on_oops" to the boot params, and then manually call
>>>>> "tracing_off(); panic();" from the code at the first printk that seems off in
>>>>> your comparison of good vs bad. For example, if "crng init done" timestamp is
>>>>> off, put the "tracing_off(); panic();" there. Then grab the serial console
>>>>> output to see what were the last callbacks that was queued/invoked.
>>>> 
>>>> We do seem to be in need of some way to quickly and easily locate the
>>>> callback that needed to be _flush() due to a wakeup.
>>>> 
>>> <snip>
>>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>>> index aeea9731ef80..fe1146d97f1a 100644
>>> --- a/kernel/workqueue.c
>>> +++ b/kernel/workqueue.c
>>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
>>> 
>>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
>>>                rwork->wq = wq;
>>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
>>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
>>>                return true;
>>>        }
>>> 
>>> <snip>
>>> 
>>> ?
>>> 
>>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
>> 
>> Ah, but at least its progress, thanks. Could you send me a patch to include
>> in the next revision with details of this?
>> 
>>>> Might one more proactive approach be to use Coccinelle to locate such
>>>> callback functions?  We might not want -all- callbacks that do wakeups
>>>> to use call_rcu_flush(), but knowing which are which should speed up
>>>> slow-boot debugging by quite a bit.
>>>> 
>>>> Or is there a better way to do this?
>>>> 
>>> I am not sure what Coccinelle is. If we had something automated that measures
>>> a boot time and if needed does some profiling it would be good. Otherwise it
>>> is a manual debugging mainly, IMHO.
>> 
>> Paul, What about using a default-off kernel CONFIG that splats on all lazy
>> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
>> in kernel I think. I can talk to Steve to get ideas on how to do that but I
>> think it can be done purely from trace events (we might need a new
>> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> 
> Could you look for wakeups invoked between trace_rcu_batch_start() and
> trace_rcu_batch_end() that are not from interrupt context?  This would
> of course need to be associated with a task rather than a CPU.

Yes this sounds good, but we also need to know if the callbacks are lazy or not since wake-up is ok from a non lazy one. I think I’ll need a table to track that at queuing time.

> Note that you would need to check for wakeups from interrupt handlers
> even with the extra trace_end_invoke_callback().  The window where an
> interrupt handler could do a wakeup would be reduced, but not eliminated.

True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?

Thanks,

  - Joel

> 
>                            Thanx, Paul

Paul E. McKenney Sept. 26, 2022, 11:53 p.m. UTC | #31

On Mon, Sep 26, 2022 at 07:33:17PM -0400, Joel Fernandes wrote:
> 
> 
> > On Sep 26, 2022, at 6:37 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Mon, Sep 26, 2022 at 09:07:12PM +0000, Joel Fernandes wrote:
> >> Hi Paul,
> >> 
> >> On Mon, Sep 26, 2022 at 10:42:40AM -0700, Paul E. McKenney wrote:
> >> [..]
> >>>>>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
> >>>>>>>> +    } else {
> >>>>>>>> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> >>>>>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
> >>>>>>> 
> >>>>>>> This WRITE_ONCE() can be dropped out of the "if" statement, correct?
> >>>>>> 
> >>>>>> Yes will update.
> >>>>> 
> >>>>> Thank you!
> >>>>> 
> >>>>>>> If so, this could be an "if" statement with two statements in its "then"
> >>>>>>> clause, no "else" clause, and two statements following the "if" statement.
> >>>>>> 
> >>>>>> I don’t think we can get rid of the else part but I’ll see what it looks like.
> >>>>> 
> >>>>> In the function header, s/rhp/rhp_in/, then:
> >>>>> 
> >>>>>    struct rcu_head *rhp = rhp_in;
> >>>>> 
> >>>>> And then:
> >>>>> 
> >>>>>    if (lazy && rhp) {
> >>>>>        rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> >>>>>        rhp = NULL;
> >>>> 
> >>>> This enqueues on to the bypass list, where as if lazy && rhp, I want to queue
> >>>> the new rhp on to the main cblist. So the pseudo code in my patch is:
> >>>> 
> >>>> if (lazy and rhp) then
> >>>>    1. flush bypass CBs on to main list.
> >>>>    2. queue new CB on to main list.
> >>> 
> >>> And the difference is here, correct?  I enqueue to the bypass list,
> >>> which is then flushed (in order) to the main list.  In contrast, you
> >>> flush the bypass list, then enqueue to the main list.  Either way,
> >>> the callback referenced by rhp ends up at the end of ->cblist.
> >>> 
> >>> Or am I on the wrong branch of this "if" statement?
> >> 
> >> But we have to flush first, and then queue the new one. Otherwise wouldn't
> >> the callbacks be invoked out of order? Or did I miss something?
> > 
> > I don't think so...
> > 
> > We want the new callback to be last, right?  One way to do that is to
> > flush the bypass, then queue the new callback onto ->cblist.  Another way
> > to do that is to enqueue the new callback onto the end of the bypass,
> > then flush the bypass.  Why wouldn't these result in the same order?
> 
> Yes you are right, sorry. I was fixated on the main list. Both your snippet and my patch will be equivalent then. However I find your snippet a bit confusing, as in it is not immediately obvious - why would we queue something on to a list, if we were about to flush it. But any way, it does make it a clever piece of code in some sense and I am ok with doing it this way ;-)

As long as the ->cblist.len comes out with the right value.  ;-)

							Thanx, Paul

> Thanks,
> 
>   - Joel
> 
> 
> > 
> >>>> else
> >>>>    1. flush bypass CBs on to main list
> >>>>    2. queue new CB on to bypass list.
> >>>> 
> >>>>>    }
> >>>>>    rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> >>>>>    WRITE_ONCE(rdp->lazy_len, 0);
> >>>>> 
> >>>>> Or did I mess something up?
> >>>> 
> >>>> So the rcu_cblist_flush_enqueue() has to happen before the
> >>>> rcu_cblist_enqueue() to preserve the ordering of flushing into the main list,
> >>>> and queuing on to the main list for the "if". Where as in your snip, the
> >>>> order is reversed.
> >>> 
> >>> Did I pick the correct branch of the "if" statement above?  Or were you
> >>> instead talking about the "else" clause?
> >>> 
> >>> I would have been more worried about getting cblist->len right.
> >> 
> >> Hmm, I think my concern was more the ordering of callbacks, and moving the
> >> write to length should be Ok.
> > 
> > OK, sounds good to me!  ;-)
> > 
> >>>> If I consolidate it then, it looks like the following. However, it is a bit
> >>>> more unreadable. I could instead just take the WRITE_ONCE out of both if/else
> >>>> and move it to after the if/else, that would be cleanest. Does that sound
> >>>> good to you? Thanks!
> >>> 
> >>> Let's first figure out whether or not we are talking past one another.  ;-)
> >> 
> >> Haha yeah :-)
> > 
> > So were we?  ;-)
> > 
> >                            Thanx, Paul

Paul E. McKenney Sept. 26, 2022, 11:57 p.m. UTC | #32

On Mon, Sep 26, 2022 at 07:44:19PM -0400, Joel Fernandes wrote:
> 
> 
> > On Sep 26, 2022, at 6:35 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
> >> Hi Vlad,
> >> 
> >> On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
> >> [...]
> >>>>> On my KVM machine the boot time is affected:
> >>>>> 
> >>>>> <snip>
> >>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> >>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> >>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> >>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> >>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> >>>>> [  104.115418] process '/usr/bin/fstype' started with executable stack
> >>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> >>>>> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> >>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
> >>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
> >>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
> >>>>> [  105.740109] random: crng init done
> >>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
> >>>>> <snip>
> >>>>> 
> >>>>> 2 - 11 and second delay is between 32 - 104. So there are still users which must
> >>>>> be waiting for "RCU" in a sync way.
> >>>> 
> >>>> I was wondering if you can compare boot logs and see which timestamp does the
> >>>> slow down start from. That way, we can narrow down the callback. Also another
> >>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> >>>> ftrace_dump_on_oops" to the boot params, and then manually call
> >>>> "tracing_off(); panic();" from the code at the first printk that seems off in
> >>>> your comparison of good vs bad. For example, if "crng init done" timestamp is
> >>>> off, put the "tracing_off(); panic();" there. Then grab the serial console
> >>>> output to see what were the last callbacks that was queued/invoked.
> >> 
> >> Would you be willing to try these steps? Meanwhile I will try on my side as
> >> well with the .config you sent me in another email.
> >> 
> >>>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >>>>>> index 08605ce7379d..40ae36904825 100644
> >>>>>> --- a/include/linux/rcupdate.h
> >>>>>> +++ b/include/linux/rcupdate.h
> >>>>>> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> >>>>>> 
> >>>>>> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> >>>>>> 
> >>>>>> +#ifdef CONFIG_RCU_LAZY
> >>>>>> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> >>>>>> +#else
> >>>>>> +static inline void call_rcu_flush(struct rcu_head *head,
> >>>>>> +        rcu_callback_t func) {  call_rcu(head, func); }
> >>>>>> +#endif
> >>>>>> +
> >>>>>> /* Internal to kernel */
> >>>>>> void rcu_init(void);
> >>>>>> extern int rcu_scheduler_active;
> >>>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> >>>>>> index f53ad63b2bc6..edd632e68497 100644
> >>>>>> --- a/kernel/rcu/Kconfig
> >>>>>> +++ b/kernel/rcu/Kconfig
> >>>>>> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> >>>>>>      Say N here if you hate read-side memory barriers.
> >>>>>>      Take the default if you are unsure.
> >>>>>> 
> >>>>>> +config RCU_LAZY
> >>>>>> +    bool "RCU callback lazy invocation functionality"
> >>>>>> +    depends on RCU_NOCB_CPU
> >>>>>> +    default n
> >>>>>> +    help
> >>>>>> +      To save power, batch RCU callbacks and flush after delay, memory
> >>>>>> +      pressure or callback list growing too big.
> >>>>>> +
> >>>>>> 
> >>>>> Do you think you need this kernel option? Can we just consider and make
> >>>>> it a run-time configurable? For example much more users will give it a try,
> >>>>> so it will increase a coverage. By default it can be off.
> >>>>> 
> >>>>> Also you do not need to do:
> >>>>> 
> >>>>> #ifdef LAZY
> >>>> 
> >>>> How does the "LAZY" macro end up being runtime-configurable? That's static /
> >>>> compile time. Did I miss something?
> >>>> 
> >>> I am talking about removing if:
> >>> 
> >>> config RCU_LAZY
> >>> 
> >>> we might run into issues related to run-time switching though.
> >> 
> >> When we started off, Paul said he wanted it kernel CONFIGurable. I will defer
> >> to Paul on a decision for that. I prefer kernel CONFIG so people don't forget
> >> to pass a boot param.
> > 
> > I am fine with a kernel boot parameter for this one.  You guys were the
> > ones preferring Kconfig options.  ;-)
> 
> Yes I still prefer that.. ;-)
> 
> > But in that case, the CONFIG_RCU_NOCB_CPU would come into play to handle
> > the case where there is no bypass.
> 
> If you don’t mind, let’s do both like we did for NOCB_CPU_ALL. In which case, Vlad since this was your suggestion, would you be so kind to send a patch adding a boot parameter on top of the series? ;-). I’ll include it in the next version. I’d suggest keep the boot param default off and add a CONFIG option that forces the boot param to be turned on.

NOCB_CPU_ALL?  If you are thinking in terms of laziness/flushing being
done on a per-CPU basis among the rcu_nocbs CPUs, that sounds like
something for later.

Are you thinking in terms of Kconfig options that allow: (1) No laziness.
(2) Laziness on all rcu_nocbs CPUs, but only if specified by a boot
parameter.  (3) Laziness on all rcu_nocbs CPUs regardless of boot
parameter.  I could get behind that.

							Thanx, Paul

Paul E. McKenney Sept. 26, 2022, 11:59 p.m. UTC | #33

On Mon, Sep 26, 2022 at 07:47:50PM -0400, Joel Fernandes wrote:
> 
> 
> > On Sep 26, 2022, at 6:32 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > 
> > On Mon, Sep 26, 2022 at 09:02:21PM +0000, Joel Fernandes wrote:
> >> On Mon, Sep 26, 2022 at 09:32:44PM +0200, Uladzislau Rezki wrote:
> >> [...]
> >>>>>> On my KVM machine the boot time is affected:
> >>>>>> 
> >>>>>> <snip>
> >>>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> >>>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> >>>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> >>>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> >>>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> >>>>>> [  104.115418] process '/usr/bin/fstype' started with executable stack
> >>>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> >>>>>> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> >>>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
> >>>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
> >>>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
> >>>>>> [  105.740109] random: crng init done
> >>>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
> >>>>>> <snip>
> >>>>>> 
> >>>>>> 2 - 11 and second delay is between 32 - 104. So there are still users which must
> >>>>>> be waiting for "RCU" in a sync way.
> >>>>> 
> >>>>> I was wondering if you can compare boot logs and see which timestamp does the
> >>>>> slow down start from. That way, we can narrow down the callback. Also another
> >>>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> >>>>> ftrace_dump_on_oops" to the boot params, and then manually call
> >>>>> "tracing_off(); panic();" from the code at the first printk that seems off in
> >>>>> your comparison of good vs bad. For example, if "crng init done" timestamp is
> >>>>> off, put the "tracing_off(); panic();" there. Then grab the serial console
> >>>>> output to see what were the last callbacks that was queued/invoked.
> >>>> 
> >>>> We do seem to be in need of some way to quickly and easily locate the
> >>>> callback that needed to be _flush() due to a wakeup.
> >>>> 
> >>> <snip>
> >>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> >>> index aeea9731ef80..fe1146d97f1a 100644
> >>> --- a/kernel/workqueue.c
> >>> +++ b/kernel/workqueue.c
> >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> >>> 
> >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> >>>                rwork->wq = wq;
> >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> >>>                return true;
> >>>        }
> >>> 
> >>> <snip>
> >>> 
> >>> ?
> >>> 
> >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> >> 
> >> Ah, but at least its progress, thanks. Could you send me a patch to include
> >> in the next revision with details of this?
> >> 
> >>>> Might one more proactive approach be to use Coccinelle to locate such
> >>>> callback functions?  We might not want -all- callbacks that do wakeups
> >>>> to use call_rcu_flush(), but knowing which are which should speed up
> >>>> slow-boot debugging by quite a bit.
> >>>> 
> >>>> Or is there a better way to do this?
> >>>> 
> >>> I am not sure what Coccinelle is. If we had something automated that measures
> >>> a boot time and if needed does some profiling it would be good. Otherwise it
> >>> is a manual debugging mainly, IMHO.
> >> 
> >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> >> think it can be done purely from trace events (we might need a new
> >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > 
> > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > trace_rcu_batch_end() that are not from interrupt context?  This would
> > of course need to be associated with a task rather than a CPU.
> 
> Yes this sounds good, but we also need to know if the callbacks are lazy or not since wake-up is ok from a non lazy one. I think I’ll need a table to track that at queuing time.

Agreed.

> > Note that you would need to check for wakeups from interrupt handlers
> > even with the extra trace_end_invoke_callback().  The window where an
> > interrupt handler could do a wakeup would be reduced, but not eliminated.
> 
> True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?

Not without terminally annoying lockdep, at least for any RCU callbacks
doing things like spin_lock_bh().

							Thanx, Paul

Joel Fernandes Sept. 27, 2022, 1:16 a.m. UTC | #34

On Mon, Sep 26, 2022 at 04:57:55PM -0700, Paul E. McKenney wrote:
[..]
> > >>>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > >>>>>> index 08605ce7379d..40ae36904825 100644
> > >>>>>> --- a/include/linux/rcupdate.h
> > >>>>>> +++ b/include/linux/rcupdate.h
> > >>>>>> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> > >>>>>> 
> > >>>>>> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > >>>>>> 
> > >>>>>> +#ifdef CONFIG_RCU_LAZY
> > >>>>>> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> > >>>>>> +#else
> > >>>>>> +static inline void call_rcu_flush(struct rcu_head *head,
> > >>>>>> +        rcu_callback_t func) {  call_rcu(head, func); }
> > >>>>>> +#endif
> > >>>>>> +
> > >>>>>> /* Internal to kernel */
> > >>>>>> void rcu_init(void);
> > >>>>>> extern int rcu_scheduler_active;
> > >>>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > >>>>>> index f53ad63b2bc6..edd632e68497 100644
> > >>>>>> --- a/kernel/rcu/Kconfig
> > >>>>>> +++ b/kernel/rcu/Kconfig
> > >>>>>> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> > >>>>>>      Say N here if you hate read-side memory barriers.
> > >>>>>>      Take the default if you are unsure.
> > >>>>>> 
> > >>>>>> +config RCU_LAZY
> > >>>>>> +    bool "RCU callback lazy invocation functionality"
> > >>>>>> +    depends on RCU_NOCB_CPU
> > >>>>>> +    default n
> > >>>>>> +    help
> > >>>>>> +      To save power, batch RCU callbacks and flush after delay, memory
> > >>>>>> +      pressure or callback list growing too big.
> > >>>>>> +
> > >>>>>> 
> > >>>>> Do you think you need this kernel option? Can we just consider and make
> > >>>>> it a run-time configurable? For example much more users will give it a try,
> > >>>>> so it will increase a coverage. By default it can be off.
> > >>>>> 
> > >>>>> Also you do not need to do:
> > >>>>> 
> > >>>>> #ifdef LAZY
> > >>>> 
> > >>>> How does the "LAZY" macro end up being runtime-configurable? That's static /
> > >>>> compile time. Did I miss something?
> > >>>> 
> > >>> I am talking about removing if:
> > >>> 
> > >>> config RCU_LAZY
> > >>> 
> > >>> we might run into issues related to run-time switching though.
> > >> 
> > >> When we started off, Paul said he wanted it kernel CONFIGurable. I will defer
> > >> to Paul on a decision for that. I prefer kernel CONFIG so people don't forget
> > >> to pass a boot param.
> > > 
> > > I am fine with a kernel boot parameter for this one.  You guys were the
> > > ones preferring Kconfig options.  ;-)
> > 
> > Yes I still prefer that.. ;-)
> > 
> > > But in that case, the CONFIG_RCU_NOCB_CPU would come into play to handle
> > > the case where there is no bypass.
> > 
> > If you don’t mind, let’s do both like we did for NOCB_CPU_ALL. In which
> > case, Vlad since this was your suggestion, would you be so kind to send a
> > patch adding a boot parameter on top of the series? ;-). I’ll include it
> > in the next version. I’d suggest keep the boot param default off and add
> > a CONFIG option that forces the boot param to be turned on.
> 
> NOCB_CPU_ALL?  If you are thinking in terms of laziness/flushing being
> done on a per-CPU basis among the rcu_nocbs CPUs, that sounds like
> something for later.

Oh, no, I was just trying to bring that up as an example of making boot
parameters and CONFIG options for the same thing.

> Are you thinking in terms of Kconfig options that allow: (1) No laziness.
> (2) Laziness on all rcu_nocbs CPUs, but only if specified by a boot
> parameter.  (3) Laziness on all rcu_nocbs CPUs regardless of boot
> parameter.  I could get behind that.

Sure agreed, or we could just make it CONFIG_RCU_LAZY_DEFAULT=y and if boot
param is specified, override the CONFIG. That will be the simplest and least
confusing IMO.

thanks :)

 - Joel




>

Joel Fernandes Sept. 27, 2022, 1:49 a.m. UTC | #35

On Mon, Sep 26, 2022 at 04:59:44PM -0700, Paul E. McKenney wrote:
> On Mon, Sep 26, 2022 at 07:47:50PM -0400, Joel Fernandes wrote:
> > 
> > 
> > > On Sep 26, 2022, at 6:32 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > 
> > > On Mon, Sep 26, 2022 at 09:02:21PM +0000, Joel Fernandes wrote:
> > >> On Mon, Sep 26, 2022 at 09:32:44PM +0200, Uladzislau Rezki wrote:
> > >> [...]
> > >>>>>> On my KVM machine the boot time is affected:
> > >>>>>> 
> > >>>>>> <snip>
> > >>>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > >>>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > >>>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > >>>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > >>>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > >>>>>> [  104.115418] process '/usr/bin/fstype' started with executable stack
> > >>>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > >>>>>> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > >>>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
> > >>>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
> > >>>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
> > >>>>>> [  105.740109] random: crng init done
> > >>>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
> > >>>>>> <snip>
> > >>>>>> 
> > >>>>>> 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > >>>>>> be waiting for "RCU" in a sync way.
> > >>>>> 
> > >>>>> I was wondering if you can compare boot logs and see which timestamp does the
> > >>>>> slow down start from. That way, we can narrow down the callback. Also another
> > >>>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > >>>>> ftrace_dump_on_oops" to the boot params, and then manually call
> > >>>>> "tracing_off(); panic();" from the code at the first printk that seems off in
> > >>>>> your comparison of good vs bad. For example, if "crng init done" timestamp is
> > >>>>> off, put the "tracing_off(); panic();" there. Then grab the serial console
> > >>>>> output to see what were the last callbacks that was queued/invoked.
> > >>>> 
> > >>>> We do seem to be in need of some way to quickly and easily locate the
> > >>>> callback that needed to be _flush() due to a wakeup.
> > >>>> 
> > >>> <snip>
> > >>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > >>> index aeea9731ef80..fe1146d97f1a 100644
> > >>> --- a/kernel/workqueue.c
> > >>> +++ b/kernel/workqueue.c
> > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> > >>> 
> > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > >>>                rwork->wq = wq;
> > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > >>>                return true;
> > >>>        }
> > >>> 
> > >>> <snip>
> > >>> 
> > >>> ?
> > >>> 
> > >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> > >> 
> > >> Ah, but at least its progress, thanks. Could you send me a patch to include
> > >> in the next revision with details of this?
> > >> 
> > >>>> Might one more proactive approach be to use Coccinelle to locate such
> > >>>> callback functions?  We might not want -all- callbacks that do wakeups
> > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > >>>> slow-boot debugging by quite a bit.
> > >>>> 
> > >>>> Or is there a better way to do this?
> > >>>> 
> > >>> I am not sure what Coccinelle is. If we had something automated that measures
> > >>> a boot time and if needed does some profiling it would be good. Otherwise it
> > >>> is a manual debugging mainly, IMHO.
> > >> 
> > >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> > >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> > >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> > >> think it can be done purely from trace events (we might need a new
> > >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > > 
> > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > of course need to be associated with a task rather than a CPU.
> > 
> > Yes this sounds good, but we also need to know if the callbacks are lazy or not since wake-up is ok from a non lazy one. I think I’ll need a table to track that at queuing time.
> 
> Agreed.
> 
> > > Note that you would need to check for wakeups from interrupt handlers
> > > even with the extra trace_end_invoke_callback().  The window where an
> > > interrupt handler could do a wakeup would be reduced, but not eliminated.
> > 
> > True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?
> 
> Not without terminally annoying lockdep, at least for any RCU callbacks
> doing things like spin_lock_bh().
> 

Sorry if my last email bounced. Looks like my iPhone betrayed me this once ;)

I was thinking something like this:
1. Put a flag in rcu_head to mark CBs as lazy.
2. Add a trace_rcu_invoke_callback_end() trace point.

Both #1 and #2 can be a debug CONFIG option. #2 can be a tracepoint and not
exposed if needed.

3. Put an in-kernel probe on both trace_rcu_invoke_callback_start() and
trace_rcu_invoke_callback_end(). In the start probe, set a per-task flag if
the current CB is lazy. In the end probe, clear it.

4. Put an in-kernel probe on trace_rcu_sched_wakeup().

Splat in the wake up probe if:
1. Hard IRQs are on.
2. The per-cpu flag is set.

#3 actually does not even need probes if we can directly call the functions
from the rcu_do_batch() function.

I'll work on it in the morning and also look into Vlad's config.

thanks,

 - Joel

Paul E. McKenney Sept. 27, 2022, 3:20 a.m. UTC | #36

On Tue, Sep 27, 2022 at 01:16:23AM +0000, Joel Fernandes wrote:
> On Mon, Sep 26, 2022 at 04:57:55PM -0700, Paul E. McKenney wrote:
> [..]
> > > >>>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > >>>>>> index 08605ce7379d..40ae36904825 100644
> > > >>>>>> --- a/include/linux/rcupdate.h
> > > >>>>>> +++ b/include/linux/rcupdate.h
> > > >>>>>> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
> > > >>>>>> 
> > > >>>>>> #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > >>>>>> 
> > > >>>>>> +#ifdef CONFIG_RCU_LAZY
> > > >>>>>> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> > > >>>>>> +#else
> > > >>>>>> +static inline void call_rcu_flush(struct rcu_head *head,
> > > >>>>>> +        rcu_callback_t func) {  call_rcu(head, func); }
> > > >>>>>> +#endif
> > > >>>>>> +
> > > >>>>>> /* Internal to kernel */
> > > >>>>>> void rcu_init(void);
> > > >>>>>> extern int rcu_scheduler_active;
> > > >>>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > >>>>>> index f53ad63b2bc6..edd632e68497 100644
> > > >>>>>> --- a/kernel/rcu/Kconfig
> > > >>>>>> +++ b/kernel/rcu/Kconfig
> > > >>>>>> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
> > > >>>>>>      Say N here if you hate read-side memory barriers.
> > > >>>>>>      Take the default if you are unsure.
> > > >>>>>> 
> > > >>>>>> +config RCU_LAZY
> > > >>>>>> +    bool "RCU callback lazy invocation functionality"
> > > >>>>>> +    depends on RCU_NOCB_CPU
> > > >>>>>> +    default n
> > > >>>>>> +    help
> > > >>>>>> +      To save power, batch RCU callbacks and flush after delay, memory
> > > >>>>>> +      pressure or callback list growing too big.
> > > >>>>>> +
> > > >>>>>> 
> > > >>>>> Do you think you need this kernel option? Can we just consider and make
> > > >>>>> it a run-time configurable? For example much more users will give it a try,
> > > >>>>> so it will increase a coverage. By default it can be off.
> > > >>>>> 
> > > >>>>> Also you do not need to do:
> > > >>>>> 
> > > >>>>> #ifdef LAZY
> > > >>>> 
> > > >>>> How does the "LAZY" macro end up being runtime-configurable? That's static /
> > > >>>> compile time. Did I miss something?
> > > >>>> 
> > > >>> I am talking about removing if:
> > > >>> 
> > > >>> config RCU_LAZY
> > > >>> 
> > > >>> we might run into issues related to run-time switching though.
> > > >> 
> > > >> When we started off, Paul said he wanted it kernel CONFIGurable. I will defer
> > > >> to Paul on a decision for that. I prefer kernel CONFIG so people don't forget
> > > >> to pass a boot param.
> > > > 
> > > > I am fine with a kernel boot parameter for this one.  You guys were the
> > > > ones preferring Kconfig options.  ;-)
> > > 
> > > Yes I still prefer that.. ;-)
> > > 
> > > > But in that case, the CONFIG_RCU_NOCB_CPU would come into play to handle
> > > > the case where there is no bypass.
> > > 
> > > If you don’t mind, let’s do both like we did for NOCB_CPU_ALL. In which
> > > case, Vlad since this was your suggestion, would you be so kind to send a
> > > patch adding a boot parameter on top of the series? ;-). I’ll include it
> > > in the next version. I’d suggest keep the boot param default off and add
> > > a CONFIG option that forces the boot param to be turned on.
> > 
> > NOCB_CPU_ALL?  If you are thinking in terms of laziness/flushing being
> > done on a per-CPU basis among the rcu_nocbs CPUs, that sounds like
> > something for later.
> 
> Oh, no, I was just trying to bring that up as an example of making boot
> parameters and CONFIG options for the same thing.
> 
> > Are you thinking in terms of Kconfig options that allow: (1) No laziness.
> > (2) Laziness on all rcu_nocbs CPUs, but only if specified by a boot
> > parameter.  (3) Laziness on all rcu_nocbs CPUs regardless of boot
> > parameter.  I could get behind that.
> 
> Sure agreed, or we could just make it CONFIG_RCU_LAZY_DEFAULT=y and if boot
> param is specified, override the CONFIG. That will be the simplest and least
> confusing IMO.

If CONFIG_RCU_LAZY_DEFAULT=n, what (if anything) does the boot parameter do?

Not criticizing, not yet, anyway.  ;-)

							Thanx, Paul

Paul E. McKenney Sept. 27, 2022, 3:21 a.m. UTC | #37

On Mon, Sep 26, 2022 at 08:03:43PM -0400, Joel Fernandes wrote:
> On Mon, Sep 26, 2022 at 7:59 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> > On Mon, Sep 26, 2022 at 07:47:50PM -0400, Joel Fernandes wrote:
> > >
> > >
> > > > On Sep 26, 2022, at 6:32 PM, Paul E. McKenney <paulmck@kernel.org>
> > wrote:
> > > >
> > > > On Mon, Sep 26, 2022 at 09:02:21PM +0000, Joel Fernandes wrote:
> > > >> On Mon, Sep 26, 2022 at 09:32:44PM +0200, Uladzislau Rezki wrote:
> > > >> [...]
> > > >>>>>> On my KVM machine the boot time is affected:
> > > >>>>>>
> > > >>>>>> <snip>
> > > >>>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network
> > Connection
> > > >>>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > >>>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw
> > xa/form2 tray
> > > >>>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > >>>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > >>>>>> [  104.115418] process '/usr/bin/fstype' started with executable
> > stack
> > > >>>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered
> > data mode. Quota mode: none.
> > > >>>>>> [  104.340125] systemd[1]: systemd 241 running in system mode.
> > (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP
> > +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN
> > -PCRE2 default-hierarchy=hybrid)
> > > >>>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
> > > >>>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
> > > >>>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > >>>>>> [  105.740109] random: crng init done
> > > >>>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > >>>>>> <snip>
> > > >>>>>>
> > > >>>>>> 2 - 11 and second delay is between 32 - 104. So there are still
> > users which must
> > > >>>>>> be waiting for "RCU" in a sync way.
> > > >>>>>
> > > >>>>> I was wondering if you can compare boot logs and see which
> > timestamp does the
> > > >>>>> slow down start from. That way, we can narrow down the callback.
> > Also another
> > > >>>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > >>>>> ftrace_dump_on_oops" to the boot params, and then manually call
> > > >>>>> "tracing_off(); panic();" from the code at the first printk that
> > seems off in
> > > >>>>> your comparison of good vs bad. For example, if "crng init done"
> > timestamp is
> > > >>>>> off, put the "tracing_off(); panic();" there. Then grab the serial
> > console
> > > >>>>> output to see what were the last callbacks that was queued/invoked.
> > > >>>>
> > > >>>> We do seem to be in need of some way to quickly and easily locate
> > the
> > > >>>> callback that needed to be _flush() due to a wakeup.
> > > >>>>
> > > >>> <snip>
> > > >>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > > >>> index aeea9731ef80..fe1146d97f1a 100644
> > > >>> --- a/kernel/workqueue.c
> > > >>> +++ b/kernel/workqueue.c
> > > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct
> > *wq, struct rcu_work *rwork)
> > > >>>
> > > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
> > work_data_bits(work))) {
> > > >>>                rwork->wq = wq;
> > > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > > >>>                return true;
> > > >>>        }
> > > >>>
> > > >>> <snip>
> > > >>>
> > > >>> ?
> > > >>>
> > > >>> But it does not fully solve my boot-up issue. Will debug tomorrow
> > further.
> > > >>
> > > >> Ah, but at least its progress, thanks. Could you send me a patch to
> > include
> > > >> in the next revision with details of this?
> > > >>
> > > >>>> Might one more proactive approach be to use Coccinelle to locate
> > such
> > > >>>> callback functions?  We might not want -all- callbacks that do
> > wakeups
> > > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > > >>>> slow-boot debugging by quite a bit.
> > > >>>>
> > > >>>> Or is there a better way to do this?
> > > >>>>
> > > >>> I am not sure what Coccinelle is. If we had something automated that
> > measures
> > > >>> a boot time and if needed does some profiling it would be good.
> > Otherwise it
> > > >>> is a manual debugging mainly, IMHO.
> > > >>
> > > >> Paul, What about using a default-off kernel CONFIG that splats on all
> > lazy
> > > >> call_rcu() callbacks that do a wake up. We could use the trace hooks
> > to do it
> > > >> in kernel I think. I can talk to Steve to get ideas on how to do that
> > but I
> > > >> think it can be done purely from trace events (we might need a new
> > > >> trace_end_invoke_callback to fire after the callback is invoked).
> > Thoughts?
> > > >
> > > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > > of course need to be associated with a task rather than a CPU.
> > >
> > > Yes this sounds good, but we also need to know if the callbacks are lazy
> > or not since wake-up is ok from a non lazy one. I think I’ll need a table
> > to track that at queuing time.
> >
> > Agreed.
> >
> > > > Note that you would need to check for wakeups from interrupt handlers
> > > > even with the extra trace_end_invoke_callback().  The window where an
> > > > interrupt handler could do a wakeup would be reduced, but not
> > eliminated.
> > >
> > > True! Since this is a  debugging option, can we not just disable
> > interrupts across callback invocation?
> >
> > Not without terminally annoying lockdep, at least for any RCU callbacks
> > doing things like spin_lock_bh().
> 
> The easy fix for that is adding “depends on !LOCKDEP” to the Kconfig ;-)
> just kidding. Hmm I think I can just look at the preempt flags and
> determine if wake up happened in hard Irq context, and ignore those
> instances.

Or instrument/trace a few carefully chosen context-tracking functions.

							Thanx, Paul

Paul E. McKenney Sept. 27, 2022, 3:22 a.m. UTC | #38

On Tue, Sep 27, 2022 at 01:49:21AM +0000, Joel Fernandes wrote:
> On Mon, Sep 26, 2022 at 04:59:44PM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 26, 2022 at 07:47:50PM -0400, Joel Fernandes wrote:
> > > 
> > > 
> > > > On Sep 26, 2022, at 6:32 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > 
> > > > On Mon, Sep 26, 2022 at 09:02:21PM +0000, Joel Fernandes wrote:
> > > >> On Mon, Sep 26, 2022 at 09:32:44PM +0200, Uladzislau Rezki wrote:
> > > >> [...]
> > > >>>>>> On my KVM machine the boot time is affected:
> > > >>>>>> 
> > > >>>>>> <snip>
> > > >>>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > >>>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > >>>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > >>>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > >>>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > >>>>>> [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > >>>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > >>>>>> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > >>>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
> > > >>>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
> > > >>>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > >>>>>> [  105.740109] random: crng init done
> > > >>>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > >>>>>> <snip>
> > > >>>>>> 
> > > >>>>>> 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > >>>>>> be waiting for "RCU" in a sync way.
> > > >>>>> 
> > > >>>>> I was wondering if you can compare boot logs and see which timestamp does the
> > > >>>>> slow down start from. That way, we can narrow down the callback. Also another
> > > >>>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > >>>>> ftrace_dump_on_oops" to the boot params, and then manually call
> > > >>>>> "tracing_off(); panic();" from the code at the first printk that seems off in
> > > >>>>> your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > >>>>> off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > >>>>> output to see what were the last callbacks that was queued/invoked.
> > > >>>> 
> > > >>>> We do seem to be in need of some way to quickly and easily locate the
> > > >>>> callback that needed to be _flush() due to a wakeup.
> > > >>>> 
> > > >>> <snip>
> > > >>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > > >>> index aeea9731ef80..fe1146d97f1a 100644
> > > >>> --- a/kernel/workqueue.c
> > > >>> +++ b/kernel/workqueue.c
> > > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> > > >>> 
> > > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > > >>>                rwork->wq = wq;
> > > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > > >>>                return true;
> > > >>>        }
> > > >>> 
> > > >>> <snip>
> > > >>> 
> > > >>> ?
> > > >>> 
> > > >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> > > >> 
> > > >> Ah, but at least its progress, thanks. Could you send me a patch to include
> > > >> in the next revision with details of this?
> > > >> 
> > > >>>> Might one more proactive approach be to use Coccinelle to locate such
> > > >>>> callback functions?  We might not want -all- callbacks that do wakeups
> > > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > > >>>> slow-boot debugging by quite a bit.
> > > >>>> 
> > > >>>> Or is there a better way to do this?
> > > >>>> 
> > > >>> I am not sure what Coccinelle is. If we had something automated that measures
> > > >>> a boot time and if needed does some profiling it would be good. Otherwise it
> > > >>> is a manual debugging mainly, IMHO.
> > > >> 
> > > >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> > > >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> > > >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> > > >> think it can be done purely from trace events (we might need a new
> > > >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > > > 
> > > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > > of course need to be associated with a task rather than a CPU.
> > > 
> > > Yes this sounds good, but we also need to know if the callbacks are lazy or not since wake-up is ok from a non lazy one. I think I’ll need a table to track that at queuing time.
> > 
> > Agreed.
> > 
> > > > Note that you would need to check for wakeups from interrupt handlers
> > > > even with the extra trace_end_invoke_callback().  The window where an
> > > > interrupt handler could do a wakeup would be reduced, but not eliminated.
> > > 
> > > True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?
> > 
> > Not without terminally annoying lockdep, at least for any RCU callbacks
> > doing things like spin_lock_bh().
> > 
> 
> Sorry if my last email bounced. Looks like my iPhone betrayed me this once ;)
> 
> I was thinking something like this:
> 1. Put a flag in rcu_head to mark CBs as lazy.
> 2. Add a trace_rcu_invoke_callback_end() trace point.
> 
> Both #1 and #2 can be a debug CONFIG option. #2 can be a tracepoint and not
> exposed if needed.
> 
> 3. Put an in-kernel probe on both trace_rcu_invoke_callback_start() and
> trace_rcu_invoke_callback_end(). In the start probe, set a per-task flag if
> the current CB is lazy. In the end probe, clear it.
> 
> 4. Put an in-kernel probe on trace_rcu_sched_wakeup().
> 
> Splat in the wake up probe if:
> 1. Hard IRQs are on.
> 2. The per-cpu flag is set.
> 
> #3 actually does not even need probes if we can directly call the functions
> from the rcu_do_batch() function.

This is fine for an experiment or a debugging session, but a solution
based totally on instrumentation would be better for production use.

> I'll work on it in the morning and also look into Vlad's config.

Sounds good!

							Thanx, Paul

Joel Fernandes Sept. 27, 2022, 1:05 p.m. UTC | #39

On Mon, Sep 26, 2022 at 08:22:46PM -0700, Paul E. McKenney wrote:
[..]
> > > > >>> --- a/kernel/workqueue.c
> > > > >>> +++ b/kernel/workqueue.c
> > > > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> > > > >>> 
> > > > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > > > >>>                rwork->wq = wq;
> > > > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > > > >>>                return true;
> > > > >>>        }
> > > > >>> 
> > > > >>> <snip>
> > > > >>> 
> > > > >>> ?
> > > > >>> 
> > > > >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> > > > >> 
> > > > >> Ah, but at least its progress, thanks. Could you send me a patch to include
> > > > >> in the next revision with details of this?
> > > > >> 
> > > > >>>> Might one more proactive approach be to use Coccinelle to locate such
> > > > >>>> callback functions?  We might not want -all- callbacks that do wakeups
> > > > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > > > >>>> slow-boot debugging by quite a bit.
> > > > >>>> 
> > > > >>>> Or is there a better way to do this?
> > > > >>>> 
> > > > >>> I am not sure what Coccinelle is. If we had something automated that measures
> > > > >>> a boot time and if needed does some profiling it would be good. Otherwise it
> > > > >>> is a manual debugging mainly, IMHO.
> > > > >> 
> > > > >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> > > > >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> > > > >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> > > > >> think it can be done purely from trace events (we might need a new
> > > > >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > > > > 
> > > > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > > > of course need to be associated with a task rather than a CPU.
> > > > 
> > > > Yes this sounds good, but we also need to know if the callbacks are
> > > > lazy or not since wake-up is ok from a non lazy one. I think I’ll
> > > > need a table to track that at queuing time.
> > > 
> > > Agreed.
> > > 
> > > > > Note that you would need to check for wakeups from interrupt handlers
> > > > > even with the extra trace_end_invoke_callback().  The window where an
> > > > > interrupt handler could do a wakeup would be reduced, but not eliminated.
> > > > 
> > > > True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?
> > > 
> > > Not without terminally annoying lockdep, at least for any RCU callbacks
> > > doing things like spin_lock_bh().
> > > 
> > 
> > Sorry if my last email bounced. Looks like my iPhone betrayed me this once ;)
> > 
> > I was thinking something like this:
> > 1. Put a flag in rcu_head to mark CBs as lazy.
> > 2. Add a trace_rcu_invoke_callback_end() trace point.
> > 
> > Both #1 and #2 can be a debug CONFIG option. #2 can be a tracepoint and not
> > exposed if needed.
> > 
> > 3. Put an in-kernel probe on both trace_rcu_invoke_callback_start() and
> > trace_rcu_invoke_callback_end(). In the start probe, set a per-task flag if
> > the current CB is lazy. In the end probe, clear it.
> > 
> > 4. Put an in-kernel probe on trace_rcu_sched_wakeup().
> > 
> > Splat in the wake up probe if:
> > 1. Hard IRQs are on.
> > 2. The per-cpu flag is set.
> > 
> > #3 actually does not even need probes if we can directly call the functions
> > from the rcu_do_batch() function.
> 
> This is fine for an experiment or a debugging session, but a solution
> based totally on instrumentation would be better for production use.

Maybe we can borrow the least-significant bit of rhp->func to mark laziness?
Then it can be production as long as we're ok with the trace_sched_wakeup
probe.

thanks,

 - Joel

Uladzislau Rezki Sept. 27, 2022, 2:08 p.m. UTC | #40

On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
> Hi Vlad,
> 
> On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
> [...]
> > > > On my KVM machine the boot time is affected:
> > > > 
> > > > <snip>
> > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > [  105.740109] random: crng init done
> > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > <snip>
> > > > 
> > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > be waiting for "RCU" in a sync way.
> > > 
> > > I was wondering if you can compare boot logs and see which timestamp does the
> > > slow down start from. That way, we can narrow down the callback. Also another
> > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > output to see what were the last callbacks that was queued/invoked.
> 
> Would you be willing to try these steps? Meanwhile I will try on my side as
> well with the .config you sent me in another email.
>
Not exactly those steps. But see below:

<snip>
[    2.291319] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
[   17.302946] e1000 0000:00:03.0 ens3: renamed from eth0
<snip>

15 seconds delay between two prints. I have logged all call_rcu() users
between those two prints:

<snip>
# tracer: nop
#
# entries-in-buffer/entries-written: 166/166   #P:64
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
   systemd-udevd-669     [002] .....     2.338739: e1000_probe: Intel(R) PRO/1000 Network Connection
   systemd-udevd-665     [061] .....     2.338952: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....     2.338962: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-665     [061] .....     2.338965: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....     2.338968: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-665     [061] .....     2.338987: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-645     [053] .....     2.338989: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....     2.338999: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-645     [053] .....     2.339002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
     kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
         rcuop/0-17      [000] b....     6.337320: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
    kworker/38:1-744     [038] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f97e40a0
           <...>-739     [035] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8be40a0
           <...>-732     [021] d..1.     6.841486: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f53e40a0
    kworker/36:1-740     [036] d..1.     6.841487: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8fe40a0
        rcuop/21-170     [023] b....     6.849276: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
        rcuop/38-291     [052] b....     6.849950: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/38-291     [052] b....     6.849957: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
     kworker/5:1-712     [005] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f13e40a0
    kworker/19:1-727     [019] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f4be40a0
           <...>-719     [007] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1be40a0
    kworker/13:1-721     [013] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f33e40a0
    kworker/52:1-756     [052] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcfe40a0
    kworker/29:1-611     [029] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f73e40a0
           <...>-754     [049] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc3e40a0
    kworker/12:1-726     [012] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f2fe40a0
    kworker/53:1-710     [053] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd3e40a0
           <...>-762     [061] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ff3e40a0
           <...>-757     [054] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd7e40a0
    kworker/25:1-537     [025] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f63e40a0
           <...>-714     [004] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0fe40a0
           <...>-749     [044] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fafe40a0
    kworker/51:1-755     [051] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcbe40a0
           <...>-764     [063] d..1.     7.097415: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ffbe40a0
           <...>-753     [045] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fb3e40a0
    kworker/43:1-748     [043] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fabe40a0
    kworker/41:1-747     [041] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa3e40a0
    kworker/57:1-760     [057] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe3e40a0
           <...>-720     [008] d..1.     7.097418: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1fe40a0
    kworker/58:1-759     [058] d..1.     7.097421: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe7e40a0
    kworker/16:1-728     [016] d..1.     7.097424: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f3fe40a0
           <...>-722     [010] d..1.     7.097427: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f27e40a0
    kworker/22:1-733     [022] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f57e40a0
           <...>-731     [026] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f67e40a0
           <...>-752     [048] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fbfe40a0
    kworker/18:0-147     [018] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f47e40a0
    kworker/39:1-745     [039] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f9be40a0
           <...>-716     [003] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0be40a0
           <...>-703     [050] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc7e40a0
    kworker/42:1-746     [042] d..1.     7.097444: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa7e40a0
        rcuop/13-113     [013] b....     7.105592: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/13-113     [013] b....     7.105595: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/10-92      [040] b....     7.105608: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/10-92      [040] b....     7.105610: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/16-135     [023] b....     7.105613: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/8-78      [039] b....     7.105636: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/8-78      [039] b....     7.105640: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/12-106     [040] b....     7.105651: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/12-106     [040] b....     7.105652: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/19-156     [000] b....     7.105727: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/19-156     [000] b....     7.105730: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/5-56      [058] b....     7.105808: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/20-163     [023] b....    17.345655: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/14-120     [013] b....    17.345675: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/14-120     [013] b....    17.345681: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/6-63      [013] b....    17.345714: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/6-63      [013] b....    17.345715: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/9-85      [000] b....    17.345753: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/9-85      [000] b....    17.345758: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/17-142     [000] b....    17.345775: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/17-142     [000] b....    17.345776: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/17-142     [000] b....    17.345777: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/11-99      [000] b....    17.345810: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/11-99      [000] b....    17.345811: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/15-127     [013] b....    17.345832: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/15-127     [013] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/1-28      [000] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
         rcuop/1-28      [000] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/15-127     [013] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
        rcuop/15-127     [013] b....    17.345837: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
   systemd-udevd-633     [035] .....    17.346591: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-633     [035] .....    17.346609: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-633     [035] .....    17.346659: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-633     [035] .....    17.346666: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-669     [002] .....    17.347573: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
     kworker/2:2-769     [002] .....    17.347659: __call_rcu_common: -> 0x0: __wait_rcu_gp+0xff/0x120 <- 0x0
   systemd-udevd-675     [012] .....    17.347981: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-675     [012] .....    17.348002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-675     [012] .....    17.348037: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348098: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348117: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-665     [061] .....    17.348120: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348156: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348166: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348176: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-642     [050] .....    17.348179: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348186: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348197: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348200: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348231: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348240: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348250: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348259: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348262: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348305: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348317: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348332: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348336: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348394: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348403: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348406: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....    17.348503: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....    17.348531: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-645     [053] .....    17.348535: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348536: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....    17.348563: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-665     [061] .....    17.348575: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348628: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348704: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348828: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348884: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348904: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348954: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348983: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.348993: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349014: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349024: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349026: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349119: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349182: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349243: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349430: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349462: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349472: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349483: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349486: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349583: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349632: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349666: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349699: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....    17.349727: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349733: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....    17.349739: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-645     [053] .....    17.349742: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349765: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-645     [053] .....    17.349766: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-642     [050] .....    17.349780: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-665     [061] .....    17.349800: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-642     [050] .....    17.349815: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-642     [050] .....    17.349829: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-642     [050] .....    17.349832: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349834: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-675     [012] .....    17.349835: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-675     [012] .....    17.349853: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-642     [050] .....    17.349861: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
   systemd-udevd-675     [012] .....    17.349873: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.349879: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.350007: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.350011: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.350080: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.350175: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
   systemd-udevd-665     [061] .....    17.350362: dev_change_name: --> renamed from eth0
<snip>

First delay:

<snip>
systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
  kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
<snip>

__dentry_kill() function and after 4 seconds there is another one queue_rcu_work().
I have checked the __dentry_kill() if it can do any sync talk with RCU but from the
first glance i do not see anything critical. But more attention is required.

Second delay:

<snip>
  rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
 rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
<snip>

10 seconds. But please note that it could be that there were not any callbacks queued
during 10 seconds.

--
Uladzislau Rezki

Paul E. McKenney Sept. 27, 2022, 2:14 p.m. UTC | #41

On Tue, Sep 27, 2022 at 01:05:41PM +0000, Joel Fernandes wrote:
> On Mon, Sep 26, 2022 at 08:22:46PM -0700, Paul E. McKenney wrote:
> [..]
> > > > > >>> --- a/kernel/workqueue.c
> > > > > >>> +++ b/kernel/workqueue.c
> > > > > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> > > > > >>> 
> > > > > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > > > > >>>                rwork->wq = wq;
> > > > > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > > > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > > > > >>>                return true;
> > > > > >>>        }
> > > > > >>> 
> > > > > >>> <snip>
> > > > > >>> 
> > > > > >>> ?
> > > > > >>> 
> > > > > >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> > > > > >> 
> > > > > >> Ah, but at least its progress, thanks. Could you send me a patch to include
> > > > > >> in the next revision with details of this?
> > > > > >> 
> > > > > >>>> Might one more proactive approach be to use Coccinelle to locate such
> > > > > >>>> callback functions?  We might not want -all- callbacks that do wakeups
> > > > > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > > > > >>>> slow-boot debugging by quite a bit.
> > > > > >>>> 
> > > > > >>>> Or is there a better way to do this?
> > > > > >>>> 
> > > > > >>> I am not sure what Coccinelle is. If we had something automated that measures
> > > > > >>> a boot time and if needed does some profiling it would be good. Otherwise it
> > > > > >>> is a manual debugging mainly, IMHO.
> > > > > >> 
> > > > > >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> > > > > >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> > > > > >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> > > > > >> think it can be done purely from trace events (we might need a new
> > > > > >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > > > > > 
> > > > > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > > > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > > > > of course need to be associated with a task rather than a CPU.
> > > > > 
> > > > > Yes this sounds good, but we also need to know if the callbacks are
> > > > > lazy or not since wake-up is ok from a non lazy one. I think I’ll
> > > > > need a table to track that at queuing time.
> > > > 
> > > > Agreed.
> > > > 
> > > > > > Note that you would need to check for wakeups from interrupt handlers
> > > > > > even with the extra trace_end_invoke_callback().  The window where an
> > > > > > interrupt handler could do a wakeup would be reduced, but not eliminated.
> > > > > 
> > > > > True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?
> > > > 
> > > > Not without terminally annoying lockdep, at least for any RCU callbacks
> > > > doing things like spin_lock_bh().
> > > > 
> > > 
> > > Sorry if my last email bounced. Looks like my iPhone betrayed me this once ;)
> > > 
> > > I was thinking something like this:
> > > 1. Put a flag in rcu_head to mark CBs as lazy.
> > > 2. Add a trace_rcu_invoke_callback_end() trace point.
> > > 
> > > Both #1 and #2 can be a debug CONFIG option. #2 can be a tracepoint and not
> > > exposed if needed.
> > > 
> > > 3. Put an in-kernel probe on both trace_rcu_invoke_callback_start() and
> > > trace_rcu_invoke_callback_end(). In the start probe, set a per-task flag if
> > > the current CB is lazy. In the end probe, clear it.
> > > 
> > > 4. Put an in-kernel probe on trace_rcu_sched_wakeup().
> > > 
> > > Splat in the wake up probe if:
> > > 1. Hard IRQs are on.
> > > 2. The per-cpu flag is set.
> > > 
> > > #3 actually does not even need probes if we can directly call the functions
> > > from the rcu_do_batch() function.
> > 
> > This is fine for an experiment or a debugging session, but a solution
> > based totally on instrumentation would be better for production use.
> 
> Maybe we can borrow the least-significant bit of rhp->func to mark laziness?
> Then it can be production as long as we're ok with the trace_sched_wakeup
> probe.

Last time I tried this, there were architectures that could have odd-valued
function addresses.  Maybe this is no longer the case?

							Thanx, Paul

Joel Fernandes Sept. 27, 2022, 2:22 p.m. UTC | #42

On Tue, Sep 27, 2022 at 07:14:03AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 27, 2022 at 01:05:41PM +0000, Joel Fernandes wrote:
> > On Mon, Sep 26, 2022 at 08:22:46PM -0700, Paul E. McKenney wrote:
> > [..]
> > > > > > >>> --- a/kernel/workqueue.c
> > > > > > >>> +++ b/kernel/workqueue.c
> > > > > > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> > > > > > >>> 
> > > > > > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > > > > > >>>                rwork->wq = wq;
> > > > > > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > > > > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > > > > > >>>                return true;
> > > > > > >>>        }
> > > > > > >>> 
> > > > > > >>> <snip>
> > > > > > >>> 
> > > > > > >>> ?
> > > > > > >>> 
> > > > > > >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> > > > > > >> 
> > > > > > >> Ah, but at least its progress, thanks. Could you send me a patch to include
> > > > > > >> in the next revision with details of this?
> > > > > > >> 
> > > > > > >>>> Might one more proactive approach be to use Coccinelle to locate such
> > > > > > >>>> callback functions?  We might not want -all- callbacks that do wakeups
> > > > > > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > > > > > >>>> slow-boot debugging by quite a bit.
> > > > > > >>>> 
> > > > > > >>>> Or is there a better way to do this?
> > > > > > >>>> 
> > > > > > >>> I am not sure what Coccinelle is. If we had something automated that measures
> > > > > > >>> a boot time and if needed does some profiling it would be good. Otherwise it
> > > > > > >>> is a manual debugging mainly, IMHO.
> > > > > > >> 
> > > > > > >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> > > > > > >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> > > > > > >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> > > > > > >> think it can be done purely from trace events (we might need a new
> > > > > > >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > > > > > > 
> > > > > > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > > > > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > > > > > of course need to be associated with a task rather than a CPU.
> > > > > > 
> > > > > > Yes this sounds good, but we also need to know if the callbacks are
> > > > > > lazy or not since wake-up is ok from a non lazy one. I think I’ll
> > > > > > need a table to track that at queuing time.
> > > > > 
> > > > > Agreed.
> > > > > 
> > > > > > > Note that you would need to check for wakeups from interrupt handlers
> > > > > > > even with the extra trace_end_invoke_callback().  The window where an
> > > > > > > interrupt handler could do a wakeup would be reduced, but not eliminated.
> > > > > > 
> > > > > > True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?
> > > > > 
> > > > > Not without terminally annoying lockdep, at least for any RCU callbacks
> > > > > doing things like spin_lock_bh().
> > > > > 
> > > > 
> > > > Sorry if my last email bounced. Looks like my iPhone betrayed me this once ;)
> > > > 
> > > > I was thinking something like this:
> > > > 1. Put a flag in rcu_head to mark CBs as lazy.
> > > > 2. Add a trace_rcu_invoke_callback_end() trace point.
> > > > 
> > > > Both #1 and #2 can be a debug CONFIG option. #2 can be a tracepoint and not
> > > > exposed if needed.
> > > > 
> > > > 3. Put an in-kernel probe on both trace_rcu_invoke_callback_start() and
> > > > trace_rcu_invoke_callback_end(). In the start probe, set a per-task flag if
> > > > the current CB is lazy. In the end probe, clear it.
> > > > 
> > > > 4. Put an in-kernel probe on trace_rcu_sched_wakeup().
> > > > 
> > > > Splat in the wake up probe if:
> > > > 1. Hard IRQs are on.
> > > > 2. The per-cpu flag is set.
> > > > 
> > > > #3 actually does not even need probes if we can directly call the functions
> > > > from the rcu_do_batch() function.
> > > 
> > > This is fine for an experiment or a debugging session, but a solution
> > > based totally on instrumentation would be better for production use.
> > 
> > Maybe we can borrow the least-significant bit of rhp->func to mark laziness?
> > Then it can be production as long as we're ok with the trace_sched_wakeup
> > probe.
> 
> Last time I tried this, there were architectures that could have odd-valued
> function addresses.  Maybe this is no longer the case?

Oh ok! If this happens, maybe we can just make it depend on x86-64 assuming
x86-64 does not have pointer oddness. We can also add a warning for if the
function address is odd before setting the bit.

thanks,

 - Joel

Joel Fernandes Sept. 27, 2022, 2:30 p.m. UTC | #43

On Tue, Sep 27, 2022 at 04:08:18PM +0200, Uladzislau Rezki wrote:
> On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
> > Hi Vlad,
> > 
> > On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
> > [...]
> > > > > On my KVM machine the boot time is affected:
> > > > > 
> > > > > <snip>
> > > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > > [  105.740109] random: crng init done
> > > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > > <snip>
> > > > > 
> > > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > > be waiting for "RCU" in a sync way.
> > > > 
> > > > I was wondering if you can compare boot logs and see which timestamp does the
> > > > slow down start from. That way, we can narrow down the callback. Also another
> > > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > > output to see what were the last callbacks that was queued/invoked.
> > 
> > Would you be willing to try these steps? Meanwhile I will try on my side as
> > well with the .config you sent me in another email.
> >
> Not exactly those steps. But see below:
> 
> <snip>
> [    2.291319] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> [   17.302946] e1000 0000:00:03.0 ens3: renamed from eth0
> <snip>
> 
> 15 seconds delay between two prints. I have logged all call_rcu() users
> between those two prints:
> 
> <snip>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 166/166   #P:64
> #
> #                                _-----=> irqs-off/BH-disabled
> #                               / _----=> need-resched
> #                              | / _---=> hardirq/softirq
> #                              || / _--=> preempt-depth
> #                              ||| / _-=> migrate-disable
> #                              |||| /     delay
> #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> #              | |         |   |||||     |         |
>    systemd-udevd-669     [002] .....     2.338739: e1000_probe: Intel(R) PRO/1000 Network Connection
>    systemd-udevd-665     [061] .....     2.338952: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....     2.338962: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-665     [061] .....     2.338965: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....     2.338968: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-665     [061] .....     2.338987: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-645     [053] .....     2.338989: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....     2.338999: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-645     [053] .....     2.339002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>      kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
>          rcuop/0-17      [000] b....     6.337320: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
>     kworker/38:1-744     [038] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f97e40a0
>            <...>-739     [035] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8be40a0
>            <...>-732     [021] d..1.     6.841486: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f53e40a0
>     kworker/36:1-740     [036] d..1.     6.841487: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8fe40a0
>         rcuop/21-170     [023] b....     6.849276: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
>         rcuop/38-291     [052] b....     6.849950: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/38-291     [052] b....     6.849957: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>      kworker/5:1-712     [005] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f13e40a0
>     kworker/19:1-727     [019] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f4be40a0
>            <...>-719     [007] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1be40a0
>     kworker/13:1-721     [013] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f33e40a0
>     kworker/52:1-756     [052] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcfe40a0
>     kworker/29:1-611     [029] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f73e40a0
>            <...>-754     [049] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc3e40a0
>     kworker/12:1-726     [012] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f2fe40a0
>     kworker/53:1-710     [053] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd3e40a0
>            <...>-762     [061] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ff3e40a0
>            <...>-757     [054] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd7e40a0
>     kworker/25:1-537     [025] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f63e40a0
>            <...>-714     [004] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0fe40a0
>            <...>-749     [044] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fafe40a0
>     kworker/51:1-755     [051] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcbe40a0
>            <...>-764     [063] d..1.     7.097415: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ffbe40a0
>            <...>-753     [045] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fb3e40a0
>     kworker/43:1-748     [043] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fabe40a0
>     kworker/41:1-747     [041] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa3e40a0
>     kworker/57:1-760     [057] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe3e40a0
>            <...>-720     [008] d..1.     7.097418: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1fe40a0
>     kworker/58:1-759     [058] d..1.     7.097421: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe7e40a0
>     kworker/16:1-728     [016] d..1.     7.097424: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f3fe40a0
>            <...>-722     [010] d..1.     7.097427: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f27e40a0
>     kworker/22:1-733     [022] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f57e40a0
>            <...>-731     [026] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f67e40a0
>            <...>-752     [048] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fbfe40a0
>     kworker/18:0-147     [018] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f47e40a0
>     kworker/39:1-745     [039] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f9be40a0
>            <...>-716     [003] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0be40a0
>            <...>-703     [050] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc7e40a0
>     kworker/42:1-746     [042] d..1.     7.097444: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa7e40a0
>         rcuop/13-113     [013] b....     7.105592: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/13-113     [013] b....     7.105595: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/10-92      [040] b....     7.105608: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/10-92      [040] b....     7.105610: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/16-135     [023] b....     7.105613: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/8-78      [039] b....     7.105636: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/8-78      [039] b....     7.105640: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/12-106     [040] b....     7.105651: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/12-106     [040] b....     7.105652: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/19-156     [000] b....     7.105727: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/19-156     [000] b....     7.105730: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/5-56      [058] b....     7.105808: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/20-163     [023] b....    17.345655: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/14-120     [013] b....    17.345675: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/14-120     [013] b....    17.345681: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/6-63      [013] b....    17.345714: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/6-63      [013] b....    17.345715: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/9-85      [000] b....    17.345753: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/9-85      [000] b....    17.345758: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/17-142     [000] b....    17.345775: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/17-142     [000] b....    17.345776: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/17-142     [000] b....    17.345777: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/11-99      [000] b....    17.345810: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/11-99      [000] b....    17.345811: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/15-127     [013] b....    17.345832: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/15-127     [013] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/1-28      [000] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>          rcuop/1-28      [000] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/15-127     [013] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>         rcuop/15-127     [013] b....    17.345837: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>    systemd-udevd-633     [035] .....    17.346591: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-633     [035] .....    17.346609: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-633     [035] .....    17.346659: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-633     [035] .....    17.346666: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-669     [002] .....    17.347573: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>      kworker/2:2-769     [002] .....    17.347659: __call_rcu_common: -> 0x0: __wait_rcu_gp+0xff/0x120 <- 0x0
>    systemd-udevd-675     [012] .....    17.347981: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-675     [012] .....    17.348002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-675     [012] .....    17.348037: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348098: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348117: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-665     [061] .....    17.348120: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348156: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348166: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348176: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-642     [050] .....    17.348179: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348186: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348197: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348200: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348231: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348240: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348250: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348259: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348262: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348305: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348317: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348332: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348336: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348394: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348403: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348406: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....    17.348503: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....    17.348531: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-645     [053] .....    17.348535: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348536: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....    17.348563: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-665     [061] .....    17.348575: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348628: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348704: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348828: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348884: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348904: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348954: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348983: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.348993: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349014: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349024: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349026: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349119: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349182: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349243: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349430: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349462: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349472: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349483: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349486: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349583: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349632: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349666: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349699: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....    17.349727: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349733: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....    17.349739: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-645     [053] .....    17.349742: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349765: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-645     [053] .....    17.349766: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-642     [050] .....    17.349780: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-665     [061] .....    17.349800: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-642     [050] .....    17.349815: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-642     [050] .....    17.349829: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-642     [050] .....    17.349832: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349834: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-675     [012] .....    17.349835: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-675     [012] .....    17.349853: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-642     [050] .....    17.349861: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>    systemd-udevd-675     [012] .....    17.349873: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.349879: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.350007: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.350011: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.350080: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.350175: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>    systemd-udevd-665     [061] .....    17.350362: dev_change_name: --> renamed from eth0
> <snip>
> 
> First delay:
> 
> <snip>
> systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>   kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
> <snip>
> 
> __dentry_kill() function and after 4 seconds there is another one queue_rcu_work().
> I have checked the __dentry_kill() if it can do any sync talk with RCU but from the
> first glance i do not see anything critical. But more attention is required.

Can you log rcu_barrier() as well? It could be that the print is just a side
effect of something else that is not being printed.

If you follow the steps I shared, we should be able to get a full log of RCU
traces.

> Second delay:
> 
> <snip>
>   rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>  rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> <snip>
> 
> 10 seconds. But please note that it could be that there were not any callbacks queued
> during 10 seconds.

True, there are things other than call_rcu() like rcu_barrier() that can
slow down the boot path. We fixed that and it is not an issue over here, but
I am not sure if you are triggering it some how.

thanks,

 - Joel

Paul E. McKenney Sept. 27, 2022, 2:30 p.m. UTC | #44

On Tue, Sep 27, 2022 at 02:22:56PM +0000, Joel Fernandes wrote:
> On Tue, Sep 27, 2022 at 07:14:03AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 27, 2022 at 01:05:41PM +0000, Joel Fernandes wrote:
> > > On Mon, Sep 26, 2022 at 08:22:46PM -0700, Paul E. McKenney wrote:
> > > [..]
> > > > > > > >>> --- a/kernel/workqueue.c
> > > > > > > >>> +++ b/kernel/workqueue.c
> > > > > > > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> > > > > > > >>> 
> > > > > > > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > > > > > > >>>                rwork->wq = wq;
> > > > > > > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > > > > > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > > > > > > >>>                return true;
> > > > > > > >>>        }
> > > > > > > >>> 
> > > > > > > >>> <snip>
> > > > > > > >>> 
> > > > > > > >>> ?
> > > > > > > >>> 
> > > > > > > >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> > > > > > > >> 
> > > > > > > >> Ah, but at least its progress, thanks. Could you send me a patch to include
> > > > > > > >> in the next revision with details of this?
> > > > > > > >> 
> > > > > > > >>>> Might one more proactive approach be to use Coccinelle to locate such
> > > > > > > >>>> callback functions?  We might not want -all- callbacks that do wakeups
> > > > > > > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > > > > > > >>>> slow-boot debugging by quite a bit.
> > > > > > > >>>> 
> > > > > > > >>>> Or is there a better way to do this?
> > > > > > > >>>> 
> > > > > > > >>> I am not sure what Coccinelle is. If we had something automated that measures
> > > > > > > >>> a boot time and if needed does some profiling it would be good. Otherwise it
> > > > > > > >>> is a manual debugging mainly, IMHO.
> > > > > > > >> 
> > > > > > > >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> > > > > > > >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> > > > > > > >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> > > > > > > >> think it can be done purely from trace events (we might need a new
> > > > > > > >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > > > > > > > 
> > > > > > > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > > > > > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > > > > > > of course need to be associated with a task rather than a CPU.
> > > > > > > 
> > > > > > > Yes this sounds good, but we also need to know if the callbacks are
> > > > > > > lazy or not since wake-up is ok from a non lazy one. I think I’ll
> > > > > > > need a table to track that at queuing time.
> > > > > > 
> > > > > > Agreed.
> > > > > > 
> > > > > > > > Note that you would need to check for wakeups from interrupt handlers
> > > > > > > > even with the extra trace_end_invoke_callback().  The window where an
> > > > > > > > interrupt handler could do a wakeup would be reduced, but not eliminated.
> > > > > > > 
> > > > > > > True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?
> > > > > > 
> > > > > > Not without terminally annoying lockdep, at least for any RCU callbacks
> > > > > > doing things like spin_lock_bh().
> > > > > > 
> > > > > 
> > > > > Sorry if my last email bounced. Looks like my iPhone betrayed me this once ;)
> > > > > 
> > > > > I was thinking something like this:
> > > > > 1. Put a flag in rcu_head to mark CBs as lazy.
> > > > > 2. Add a trace_rcu_invoke_callback_end() trace point.
> > > > > 
> > > > > Both #1 and #2 can be a debug CONFIG option. #2 can be a tracepoint and not
> > > > > exposed if needed.
> > > > > 
> > > > > 3. Put an in-kernel probe on both trace_rcu_invoke_callback_start() and
> > > > > trace_rcu_invoke_callback_end(). In the start probe, set a per-task flag if
> > > > > the current CB is lazy. In the end probe, clear it.
> > > > > 
> > > > > 4. Put an in-kernel probe on trace_rcu_sched_wakeup().
> > > > > 
> > > > > Splat in the wake up probe if:
> > > > > 1. Hard IRQs are on.
> > > > > 2. The per-cpu flag is set.
> > > > > 
> > > > > #3 actually does not even need probes if we can directly call the functions
> > > > > from the rcu_do_batch() function.
> > > > 
> > > > This is fine for an experiment or a debugging session, but a solution
> > > > based totally on instrumentation would be better for production use.
> > > 
> > > Maybe we can borrow the least-significant bit of rhp->func to mark laziness?
> > > Then it can be production as long as we're ok with the trace_sched_wakeup
> > > probe.
> > 
> > Last time I tried this, there were architectures that could have odd-valued
> > function addresses.  Maybe this is no longer the case?
> 
> Oh ok! If this happens, maybe we can just make it depend on x86-64 assuming
> x86-64 does not have pointer oddness. We can also add a warning for if the
> function address is odd before setting the bit.

Let me rephrase this...  ;-)

Given that this used to not work and still might not work, let's see
if we can find some other way to debug this.  Unless and until it can
be demonstrated that there is no supported compiler that will generated
odd-valued function addresses on any supported architecture.

Plus there was a time that x86 did odd-valued pointer addresses.
The instruction set is plenty fine with this, so it would have to be a
compiler and assembly-language convention to avoid it.

							Thanx, Paul

Uladzislau Rezki Sept. 27, 2022, 2:59 p.m. UTC | #45

On Tue, Sep 27, 2022 at 02:30:03PM +0000, Joel Fernandes wrote:
> On Tue, Sep 27, 2022 at 04:08:18PM +0200, Uladzislau Rezki wrote:
> > On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
> > > Hi Vlad,
> > > 
> > > On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
> > > [...]
> > > > > > On my KVM machine the boot time is affected:
> > > > > > 
> > > > > > <snip>
> > > > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > > > [  105.740109] random: crng init done
> > > > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > > > <snip>
> > > > > > 
> > > > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > > > be waiting for "RCU" in a sync way.
> > > > > 
> > > > > I was wondering if you can compare boot logs and see which timestamp does the
> > > > > slow down start from. That way, we can narrow down the callback. Also another
> > > > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > > > output to see what were the last callbacks that was queued/invoked.
> > > 
> > > Would you be willing to try these steps? Meanwhile I will try on my side as
> > > well with the .config you sent me in another email.
> > >
> > Not exactly those steps. But see below:
> > 
> > <snip>
> > [    2.291319] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > [   17.302946] e1000 0000:00:03.0 ens3: renamed from eth0
> > <snip>
> > 
> > 15 seconds delay between two prints. I have logged all call_rcu() users
> > between those two prints:
> > 
> > <snip>
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 166/166   #P:64
> > #
> > #                                _-----=> irqs-off/BH-disabled
> > #                               / _----=> need-resched
> > #                              | / _---=> hardirq/softirq
> > #                              || / _--=> preempt-depth
> > #                              ||| / _-=> migrate-disable
> > #                              |||| /     delay
> > #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> > #              | |         |   |||||     |         |
> >    systemd-udevd-669     [002] .....     2.338739: e1000_probe: Intel(R) PRO/1000 Network Connection
> >    systemd-udevd-665     [061] .....     2.338952: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....     2.338962: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-665     [061] .....     2.338965: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....     2.338968: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-665     [061] .....     2.338987: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-645     [053] .....     2.338989: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....     2.338999: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-645     [053] .....     2.339002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >      kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
> >          rcuop/0-17      [000] b....     6.337320: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
> >     kworker/38:1-744     [038] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f97e40a0
> >            <...>-739     [035] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8be40a0
> >            <...>-732     [021] d..1.     6.841486: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f53e40a0
> >     kworker/36:1-740     [036] d..1.     6.841487: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8fe40a0
> >         rcuop/21-170     [023] b....     6.849276: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
> >         rcuop/38-291     [052] b....     6.849950: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/38-291     [052] b....     6.849957: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >      kworker/5:1-712     [005] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f13e40a0
> >     kworker/19:1-727     [019] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f4be40a0
> >            <...>-719     [007] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1be40a0
> >     kworker/13:1-721     [013] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f33e40a0
> >     kworker/52:1-756     [052] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcfe40a0
> >     kworker/29:1-611     [029] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f73e40a0
> >            <...>-754     [049] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc3e40a0
> >     kworker/12:1-726     [012] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f2fe40a0
> >     kworker/53:1-710     [053] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd3e40a0
> >            <...>-762     [061] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ff3e40a0
> >            <...>-757     [054] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd7e40a0
> >     kworker/25:1-537     [025] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f63e40a0
> >            <...>-714     [004] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0fe40a0
> >            <...>-749     [044] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fafe40a0
> >     kworker/51:1-755     [051] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcbe40a0
> >            <...>-764     [063] d..1.     7.097415: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ffbe40a0
> >            <...>-753     [045] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fb3e40a0
> >     kworker/43:1-748     [043] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fabe40a0
> >     kworker/41:1-747     [041] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa3e40a0
> >     kworker/57:1-760     [057] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe3e40a0
> >            <...>-720     [008] d..1.     7.097418: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1fe40a0
> >     kworker/58:1-759     [058] d..1.     7.097421: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe7e40a0
> >     kworker/16:1-728     [016] d..1.     7.097424: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f3fe40a0
> >            <...>-722     [010] d..1.     7.097427: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f27e40a0
> >     kworker/22:1-733     [022] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f57e40a0
> >            <...>-731     [026] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f67e40a0
> >            <...>-752     [048] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fbfe40a0
> >     kworker/18:0-147     [018] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f47e40a0
> >     kworker/39:1-745     [039] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f9be40a0
> >            <...>-716     [003] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0be40a0
> >            <...>-703     [050] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc7e40a0
> >     kworker/42:1-746     [042] d..1.     7.097444: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa7e40a0
> >         rcuop/13-113     [013] b....     7.105592: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/13-113     [013] b....     7.105595: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/10-92      [040] b....     7.105608: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/10-92      [040] b....     7.105610: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/16-135     [023] b....     7.105613: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/8-78      [039] b....     7.105636: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/8-78      [039] b....     7.105640: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/12-106     [040] b....     7.105651: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/12-106     [040] b....     7.105652: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/19-156     [000] b....     7.105727: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/19-156     [000] b....     7.105730: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/5-56      [058] b....     7.105808: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/20-163     [023] b....    17.345655: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/14-120     [013] b....    17.345675: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/14-120     [013] b....    17.345681: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/6-63      [013] b....    17.345714: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/6-63      [013] b....    17.345715: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/9-85      [000] b....    17.345753: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/9-85      [000] b....    17.345758: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/17-142     [000] b....    17.345775: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/17-142     [000] b....    17.345776: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/17-142     [000] b....    17.345777: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/11-99      [000] b....    17.345810: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/11-99      [000] b....    17.345811: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/15-127     [013] b....    17.345832: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/15-127     [013] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/1-28      [000] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >          rcuop/1-28      [000] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/15-127     [013] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >         rcuop/15-127     [013] b....    17.345837: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> >    systemd-udevd-633     [035] .....    17.346591: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-633     [035] .....    17.346609: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-633     [035] .....    17.346659: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-633     [035] .....    17.346666: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-669     [002] .....    17.347573: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >      kworker/2:2-769     [002] .....    17.347659: __call_rcu_common: -> 0x0: __wait_rcu_gp+0xff/0x120 <- 0x0
> >    systemd-udevd-675     [012] .....    17.347981: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-675     [012] .....    17.348002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-675     [012] .....    17.348037: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348098: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348117: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-665     [061] .....    17.348120: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348156: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348166: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348176: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-642     [050] .....    17.348179: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348186: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348197: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348200: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348231: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348240: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348250: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348259: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348262: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348305: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348317: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348332: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348336: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348394: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348403: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348406: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....    17.348503: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....    17.348531: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-645     [053] .....    17.348535: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348536: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....    17.348563: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-665     [061] .....    17.348575: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348628: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348704: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348828: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348884: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348904: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348954: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348983: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.348993: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349014: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349024: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349026: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349119: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349182: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349243: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349430: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349462: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349472: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349483: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349486: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349583: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349632: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349666: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349699: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....    17.349727: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349733: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....    17.349739: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-645     [053] .....    17.349742: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349765: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-645     [053] .....    17.349766: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-642     [050] .....    17.349780: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-665     [061] .....    17.349800: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-642     [050] .....    17.349815: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-642     [050] .....    17.349829: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-642     [050] .....    17.349832: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349834: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-675     [012] .....    17.349835: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-675     [012] .....    17.349853: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-642     [050] .....    17.349861: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >    systemd-udevd-675     [012] .....    17.349873: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.349879: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.350007: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.350011: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.350080: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.350175: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> >    systemd-udevd-665     [061] .....    17.350362: dev_change_name: --> renamed from eth0
> > <snip>
> > 
> > First delay:
> > 
> > <snip>
> > systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> >   kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
> > <snip>
> > 
> > __dentry_kill() function and after 4 seconds there is another one queue_rcu_work().
> > I have checked the __dentry_kill() if it can do any sync talk with RCU but from the
> > first glance i do not see anything critical. But more attention is required.
> 
> Can you log rcu_barrier() as well? It could be that the print is just a side
> effect of something else that is not being printed.
> 
It has nothing to do with rcu_barrier() in my case. Also i have checked
the synchronize_rcu() it also works as expected, i.e. it is not a
blocking reason.

Have you tried my config?

--
Uladzislau Rezki

Uladzislau Rezki Sept. 27, 2022, 3:13 p.m. UTC | #46

On Tue, Sep 27, 2022 at 04:59:44PM +0200, Uladzislau Rezki wrote:
> On Tue, Sep 27, 2022 at 02:30:03PM +0000, Joel Fernandes wrote:
> > On Tue, Sep 27, 2022 at 04:08:18PM +0200, Uladzislau Rezki wrote:
> > > On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
> > > > Hi Vlad,
> > > > 
> > > > On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
> > > > [...]
> > > > > > > On my KVM machine the boot time is affected:
> > > > > > > 
> > > > > > > <snip>
> > > > > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > > > > [  105.740109] random: crng init done
> > > > > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > > > > <snip>
> > > > > > > 
> > > > > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > > > > be waiting for "RCU" in a sync way.
> > > > > > 
> > > > > > I was wondering if you can compare boot logs and see which timestamp does the
> > > > > > slow down start from. That way, we can narrow down the callback. Also another
> > > > > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > > > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > > > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > > > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > > > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > > > > output to see what were the last callbacks that was queued/invoked.
> > > > 
> > > > Would you be willing to try these steps? Meanwhile I will try on my side as
> > > > well with the .config you sent me in another email.
> > > >
> > > Not exactly those steps. But see below:
> > > 
> > > <snip>
> > > [    2.291319] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > [   17.302946] e1000 0000:00:03.0 ens3: renamed from eth0
> > > <snip>
> > > 
> > > 15 seconds delay between two prints. I have logged all call_rcu() users
> > > between those two prints:
> > > 
> > > <snip>
> > > # tracer: nop
> > > #
> > > # entries-in-buffer/entries-written: 166/166   #P:64
> > > #
> > > #                                _-----=> irqs-off/BH-disabled
> > > #                               / _----=> need-resched
> > > #                              | / _---=> hardirq/softirq
> > > #                              || / _--=> preempt-depth
> > > #                              ||| / _-=> migrate-disable
> > > #                              |||| /     delay
> > > #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> > > #              | |         |   |||||     |         |
> > >    systemd-udevd-669     [002] .....     2.338739: e1000_probe: Intel(R) PRO/1000 Network Connection
> > >    systemd-udevd-665     [061] .....     2.338952: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....     2.338962: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-665     [061] .....     2.338965: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....     2.338968: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-665     [061] .....     2.338987: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-645     [053] .....     2.338989: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....     2.338999: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-645     [053] .....     2.339002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >      kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
> > >          rcuop/0-17      [000] b....     6.337320: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
> > >     kworker/38:1-744     [038] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f97e40a0
> > >            <...>-739     [035] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8be40a0
> > >            <...>-732     [021] d..1.     6.841486: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f53e40a0
> > >     kworker/36:1-740     [036] d..1.     6.841487: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8fe40a0
> > >         rcuop/21-170     [023] b....     6.849276: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
> > >         rcuop/38-291     [052] b....     6.849950: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/38-291     [052] b....     6.849957: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >      kworker/5:1-712     [005] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f13e40a0
> > >     kworker/19:1-727     [019] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f4be40a0
> > >            <...>-719     [007] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1be40a0
> > >     kworker/13:1-721     [013] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f33e40a0
> > >     kworker/52:1-756     [052] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcfe40a0
> > >     kworker/29:1-611     [029] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f73e40a0
> > >            <...>-754     [049] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc3e40a0
> > >     kworker/12:1-726     [012] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f2fe40a0
> > >     kworker/53:1-710     [053] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd3e40a0
> > >            <...>-762     [061] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ff3e40a0
> > >            <...>-757     [054] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd7e40a0
> > >     kworker/25:1-537     [025] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f63e40a0
> > >            <...>-714     [004] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0fe40a0
> > >            <...>-749     [044] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fafe40a0
> > >     kworker/51:1-755     [051] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcbe40a0
> > >            <...>-764     [063] d..1.     7.097415: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ffbe40a0
> > >            <...>-753     [045] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fb3e40a0
> > >     kworker/43:1-748     [043] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fabe40a0
> > >     kworker/41:1-747     [041] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa3e40a0
> > >     kworker/57:1-760     [057] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe3e40a0
> > >            <...>-720     [008] d..1.     7.097418: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1fe40a0
> > >     kworker/58:1-759     [058] d..1.     7.097421: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe7e40a0
> > >     kworker/16:1-728     [016] d..1.     7.097424: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f3fe40a0
> > >            <...>-722     [010] d..1.     7.097427: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f27e40a0
> > >     kworker/22:1-733     [022] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f57e40a0
> > >            <...>-731     [026] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f67e40a0
> > >            <...>-752     [048] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fbfe40a0
> > >     kworker/18:0-147     [018] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f47e40a0
> > >     kworker/39:1-745     [039] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f9be40a0
> > >            <...>-716     [003] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0be40a0
> > >            <...>-703     [050] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc7e40a0
> > >     kworker/42:1-746     [042] d..1.     7.097444: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa7e40a0
> > >         rcuop/13-113     [013] b....     7.105592: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/13-113     [013] b....     7.105595: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/10-92      [040] b....     7.105608: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/10-92      [040] b....     7.105610: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/16-135     [023] b....     7.105613: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/8-78      [039] b....     7.105636: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/8-78      [039] b....     7.105640: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/12-106     [040] b....     7.105651: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/12-106     [040] b....     7.105652: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/19-156     [000] b....     7.105727: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/19-156     [000] b....     7.105730: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/5-56      [058] b....     7.105808: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/20-163     [023] b....    17.345655: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/14-120     [013] b....    17.345675: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/14-120     [013] b....    17.345681: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/6-63      [013] b....    17.345714: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/6-63      [013] b....    17.345715: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/9-85      [000] b....    17.345753: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/9-85      [000] b....    17.345758: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/17-142     [000] b....    17.345775: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/17-142     [000] b....    17.345776: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/17-142     [000] b....    17.345777: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/11-99      [000] b....    17.345810: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/11-99      [000] b....    17.345811: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/15-127     [013] b....    17.345832: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/15-127     [013] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/1-28      [000] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >          rcuop/1-28      [000] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/15-127     [013] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >         rcuop/15-127     [013] b....    17.345837: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > >    systemd-udevd-633     [035] .....    17.346591: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-633     [035] .....    17.346609: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-633     [035] .....    17.346659: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-633     [035] .....    17.346666: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-669     [002] .....    17.347573: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >      kworker/2:2-769     [002] .....    17.347659: __call_rcu_common: -> 0x0: __wait_rcu_gp+0xff/0x120 <- 0x0
> > >    systemd-udevd-675     [012] .....    17.347981: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-675     [012] .....    17.348002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-675     [012] .....    17.348037: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348098: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348117: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-665     [061] .....    17.348120: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348156: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348166: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348176: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-642     [050] .....    17.348179: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348186: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348197: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348200: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348231: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348240: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348250: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348259: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348262: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348305: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348317: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348332: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348336: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348394: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348403: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348406: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....    17.348503: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....    17.348531: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-645     [053] .....    17.348535: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348536: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....    17.348563: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-665     [061] .....    17.348575: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348628: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348704: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348828: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348884: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348904: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348954: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348983: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.348993: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349014: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349024: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349026: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349119: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349182: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349243: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349430: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349462: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349472: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349483: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349486: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349583: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349632: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349666: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349699: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....    17.349727: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349733: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....    17.349739: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-645     [053] .....    17.349742: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349765: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-645     [053] .....    17.349766: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-642     [050] .....    17.349780: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-665     [061] .....    17.349800: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-642     [050] .....    17.349815: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-642     [050] .....    17.349829: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-642     [050] .....    17.349832: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349834: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-675     [012] .....    17.349835: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-675     [012] .....    17.349853: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-642     [050] .....    17.349861: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-675     [012] .....    17.349873: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349879: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350007: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350011: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350080: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350175: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350362: dev_change_name: --> renamed from eth0
> > > <snip>
> > > 
> > > First delay:
> > > 
> > > <snip>
> > > systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >   kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
> > > <snip>
> > > 
> > > __dentry_kill() function and after 4 seconds there is another one queue_rcu_work().
> > > I have checked the __dentry_kill() if it can do any sync talk with RCU but from the
> > > first glance i do not see anything critical. But more attention is required.
> > 
> > Can you log rcu_barrier() as well? It could be that the print is just a side
> > effect of something else that is not being printed.
> > 
> It has nothing to do with rcu_barrier() in my case. Also i have checked
> the synchronize_rcu() it also works as expected, i.e. it is not a
> blocking reason.
> 
> Have you tried my config?
> 
> --
> Uladzislau Rezki
OK. Seems one place i have spot:

<snip>
[    7.074847] calling  init_sr+0x0/0x1000 [sr_mod] @ 668
[   22.422808] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
[   22.422815] cdrom: Uniform CD-ROM driver Revision: 3.20
[   32.664590] sr 1:0:0:0: Attached scsi CD-ROM sr0
[   32.664642] initcall init_sr+0x0/0x1000 [sr_mod] returned 0 after 25589786 usecs
<snip>

--
Uladzislau Rezki

Joel Fernandes Sept. 27, 2022, 3:14 p.m. UTC | #47

On Tue, Sep 27, 2022 at 04:59:44PM +0200, Uladzislau Rezki wrote:
[...]
> > >    systemd-udevd-642     [050] .....    17.349832: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349834: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-675     [012] .....    17.349835: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-675     [012] .....    17.349853: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-642     [050] .....    17.349861: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >    systemd-udevd-675     [012] .....    17.349873: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.349879: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350007: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350011: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350080: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350175: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > >    systemd-udevd-665     [061] .....    17.350362: dev_change_name: --> renamed from eth0
> > > <snip>
> > > 
> > > First delay:
> > > 
> > > <snip>
> > > systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > >   kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
> > > <snip>
> > > 
> > > __dentry_kill() function and after 4 seconds there is another one queue_rcu_work().
> > > I have checked the __dentry_kill() if it can do any sync talk with RCU but from the
> > > first glance i do not see anything critical. But more attention is required.
> > 
> > Can you log rcu_barrier() as well? It could be that the print is just a side
> > effect of something else that is not being printed.
> > 
> It has nothing to do with rcu_barrier() in my case. Also i have checked
> the synchronize_rcu() it also works as expected, i.e. it is not a
> blocking reason.
> 
> Have you tried my config?

Yes I am in the process of trying it.

thanks,

 - Joel

Joel Fernandes Sept. 27, 2022, 3:25 p.m. UTC | #48

On Tue, Sep 27, 2022 at 07:30:20AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 27, 2022 at 02:22:56PM +0000, Joel Fernandes wrote:
> > On Tue, Sep 27, 2022 at 07:14:03AM -0700, Paul E. McKenney wrote:
> > > On Tue, Sep 27, 2022 at 01:05:41PM +0000, Joel Fernandes wrote:
> > > > On Mon, Sep 26, 2022 at 08:22:46PM -0700, Paul E. McKenney wrote:
> > > > [..]
> > > > > > > > >>> --- a/kernel/workqueue.c
> > > > > > > > >>> +++ b/kernel/workqueue.c
> > > > > > > > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> > > > > > > > >>> 
> > > > > > > > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > > > > > > > >>>                rwork->wq = wq;
> > > > > > > > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > > > > > > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > > > > > > > >>>                return true;
> > > > > > > > >>>        }
> > > > > > > > >>> 
> > > > > > > > >>> <snip>
> > > > > > > > >>> 
> > > > > > > > >>> ?
> > > > > > > > >>> 
> > > > > > > > >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> > > > > > > > >> 
> > > > > > > > >> Ah, but at least its progress, thanks. Could you send me a patch to include
> > > > > > > > >> in the next revision with details of this?
> > > > > > > > >> 
> > > > > > > > >>>> Might one more proactive approach be to use Coccinelle to locate such
> > > > > > > > >>>> callback functions?  We might not want -all- callbacks that do wakeups
> > > > > > > > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > > > > > > > >>>> slow-boot debugging by quite a bit.
> > > > > > > > >>>> 
> > > > > > > > >>>> Or is there a better way to do this?
> > > > > > > > >>>> 
> > > > > > > > >>> I am not sure what Coccinelle is. If we had something automated that measures
> > > > > > > > >>> a boot time and if needed does some profiling it would be good. Otherwise it
> > > > > > > > >>> is a manual debugging mainly, IMHO.
> > > > > > > > >> 
> > > > > > > > >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> > > > > > > > >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> > > > > > > > >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> > > > > > > > >> think it can be done purely from trace events (we might need a new
> > > > > > > > >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > > > > > > > > 
> > > > > > > > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > > > > > > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > > > > > > > of course need to be associated with a task rather than a CPU.
> > > > > > > > 
> > > > > > > > Yes this sounds good, but we also need to know if the callbacks are
> > > > > > > > lazy or not since wake-up is ok from a non lazy one. I think I’ll
> > > > > > > > need a table to track that at queuing time.
> > > > > > > 
> > > > > > > Agreed.
> > > > > > > 
> > > > > > > > > Note that you would need to check for wakeups from interrupt handlers
> > > > > > > > > even with the extra trace_end_invoke_callback().  The window where an
> > > > > > > > > interrupt handler could do a wakeup would be reduced, but not eliminated.
> > > > > > > > 
> > > > > > > > True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?
> > > > > > > 
> > > > > > > Not without terminally annoying lockdep, at least for any RCU callbacks
> > > > > > > doing things like spin_lock_bh().
> > > > > > > 
> > > > > > 
> > > > > > Sorry if my last email bounced. Looks like my iPhone betrayed me this once ;)
> > > > > > 
> > > > > > I was thinking something like this:
> > > > > > 1. Put a flag in rcu_head to mark CBs as lazy.
> > > > > > 2. Add a trace_rcu_invoke_callback_end() trace point.
> > > > > > 
> > > > > > Both #1 and #2 can be a debug CONFIG option. #2 can be a tracepoint and not
> > > > > > exposed if needed.
> > > > > > 
> > > > > > 3. Put an in-kernel probe on both trace_rcu_invoke_callback_start() and
> > > > > > trace_rcu_invoke_callback_end(). In the start probe, set a per-task flag if
> > > > > > the current CB is lazy. In the end probe, clear it.
> > > > > > 
> > > > > > 4. Put an in-kernel probe on trace_rcu_sched_wakeup().
> > > > > > 
> > > > > > Splat in the wake up probe if:
> > > > > > 1. Hard IRQs are on.
> > > > > > 2. The per-cpu flag is set.
> > > > > > 
> > > > > > #3 actually does not even need probes if we can directly call the functions
> > > > > > from the rcu_do_batch() function.
> > > > > 
> > > > > This is fine for an experiment or a debugging session, but a solution
> > > > > based totally on instrumentation would be better for production use.
> > > > 
> > > > Maybe we can borrow the least-significant bit of rhp->func to mark laziness?
> > > > Then it can be production as long as we're ok with the trace_sched_wakeup
> > > > probe.
> > > 
> > > Last time I tried this, there were architectures that could have odd-valued
> > > function addresses.  Maybe this is no longer the case?
> > 
> > Oh ok! If this happens, maybe we can just make it depend on x86-64 assuming
> > x86-64 does not have pointer oddness. We can also add a warning for if the
> > function address is odd before setting the bit.
> 
> Let me rephrase this...  ;-)
> 
> Given that this used to not work and still might not work, let's see
> if we can find some other way to debug this.  Unless and until it can
> be demonstrated that there is no supported compiler that will generated
> odd-valued function addresses on any supported architecture.
> 
> Plus there was a time that x86 did odd-valued pointer addresses.
> The instruction set is plenty fine with this, so it would have to be a
> compiler and assembly-language convention to avoid it.

Ok, so then I am not sure how to make it work in production at the moment. I
could track the lazy callbacks in a hashtable but then that's overhead.

Or, I could focus on trying Vlad's config and figure out what's going on and
keep the auto-debug for later.

On another thought, this is the sort of thing that should be doable via Daniel
Bristot's runtime verification framework, as its a classical "see if these
traces look right" issue which should be teachable to a computer with a few rules.

thanks,

 - Joel

Paul E. McKenney Sept. 27, 2022, 3:59 p.m. UTC | #49

On Tue, Sep 27, 2022 at 03:25:02PM +0000, Joel Fernandes wrote:
> On Tue, Sep 27, 2022 at 07:30:20AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 27, 2022 at 02:22:56PM +0000, Joel Fernandes wrote:
> > > On Tue, Sep 27, 2022 at 07:14:03AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Sep 27, 2022 at 01:05:41PM +0000, Joel Fernandes wrote:
> > > > > On Mon, Sep 26, 2022 at 08:22:46PM -0700, Paul E. McKenney wrote:
> > > > > [..]
> > > > > > > > > >>> --- a/kernel/workqueue.c
> > > > > > > > > >>> +++ b/kernel/workqueue.c
> > > > > > > > > >>> @@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> > > > > > > > > >>> 
> > > > > > > > > >>>        if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > > > > > > > > >>>                rwork->wq = wq;
> > > > > > > > > >>> -               call_rcu(&rwork->rcu, rcu_work_rcufn);
> > > > > > > > > >>> +               call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
> > > > > > > > > >>>                return true;
> > > > > > > > > >>>        }
> > > > > > > > > >>> 
> > > > > > > > > >>> <snip>
> > > > > > > > > >>> 
> > > > > > > > > >>> ?
> > > > > > > > > >>> 
> > > > > > > > > >>> But it does not fully solve my boot-up issue. Will debug tomorrow further.
> > > > > > > > > >> 
> > > > > > > > > >> Ah, but at least its progress, thanks. Could you send me a patch to include
> > > > > > > > > >> in the next revision with details of this?
> > > > > > > > > >> 
> > > > > > > > > >>>> Might one more proactive approach be to use Coccinelle to locate such
> > > > > > > > > >>>> callback functions?  We might not want -all- callbacks that do wakeups
> > > > > > > > > >>>> to use call_rcu_flush(), but knowing which are which should speed up
> > > > > > > > > >>>> slow-boot debugging by quite a bit.
> > > > > > > > > >>>> 
> > > > > > > > > >>>> Or is there a better way to do this?
> > > > > > > > > >>>> 
> > > > > > > > > >>> I am not sure what Coccinelle is. If we had something automated that measures
> > > > > > > > > >>> a boot time and if needed does some profiling it would be good. Otherwise it
> > > > > > > > > >>> is a manual debugging mainly, IMHO.
> > > > > > > > > >> 
> > > > > > > > > >> Paul, What about using a default-off kernel CONFIG that splats on all lazy
> > > > > > > > > >> call_rcu() callbacks that do a wake up. We could use the trace hooks to do it
> > > > > > > > > >> in kernel I think. I can talk to Steve to get ideas on how to do that but I
> > > > > > > > > >> think it can be done purely from trace events (we might need a new
> > > > > > > > > >> trace_end_invoke_callback to fire after the callback is invoked). Thoughts?
> > > > > > > > > > 
> > > > > > > > > > Could you look for wakeups invoked between trace_rcu_batch_start() and
> > > > > > > > > > trace_rcu_batch_end() that are not from interrupt context?  This would
> > > > > > > > > > of course need to be associated with a task rather than a CPU.
> > > > > > > > > 
> > > > > > > > > Yes this sounds good, but we also need to know if the callbacks are
> > > > > > > > > lazy or not since wake-up is ok from a non lazy one. I think I’ll
> > > > > > > > > need a table to track that at queuing time.
> > > > > > > > 
> > > > > > > > Agreed.
> > > > > > > > 
> > > > > > > > > > Note that you would need to check for wakeups from interrupt handlers
> > > > > > > > > > even with the extra trace_end_invoke_callback().  The window where an
> > > > > > > > > > interrupt handler could do a wakeup would be reduced, but not eliminated.
> > > > > > > > > 
> > > > > > > > > True! Since this is a  debugging option, can we not just disable interrupts across callback invocation?
> > > > > > > > 
> > > > > > > > Not without terminally annoying lockdep, at least for any RCU callbacks
> > > > > > > > doing things like spin_lock_bh().
> > > > > > > > 
> > > > > > > 
> > > > > > > Sorry if my last email bounced. Looks like my iPhone betrayed me this once ;)
> > > > > > > 
> > > > > > > I was thinking something like this:
> > > > > > > 1. Put a flag in rcu_head to mark CBs as lazy.
> > > > > > > 2. Add a trace_rcu_invoke_callback_end() trace point.
> > > > > > > 
> > > > > > > Both #1 and #2 can be a debug CONFIG option. #2 can be a tracepoint and not
> > > > > > > exposed if needed.
> > > > > > > 
> > > > > > > 3. Put an in-kernel probe on both trace_rcu_invoke_callback_start() and
> > > > > > > trace_rcu_invoke_callback_end(). In the start probe, set a per-task flag if
> > > > > > > the current CB is lazy. In the end probe, clear it.
> > > > > > > 
> > > > > > > 4. Put an in-kernel probe on trace_rcu_sched_wakeup().
> > > > > > > 
> > > > > > > Splat in the wake up probe if:
> > > > > > > 1. Hard IRQs are on.
> > > > > > > 2. The per-cpu flag is set.
> > > > > > > 
> > > > > > > #3 actually does not even need probes if we can directly call the functions
> > > > > > > from the rcu_do_batch() function.
> > > > > > 
> > > > > > This is fine for an experiment or a debugging session, but a solution
> > > > > > based totally on instrumentation would be better for production use.
> > > > > 
> > > > > Maybe we can borrow the least-significant bit of rhp->func to mark laziness?
> > > > > Then it can be production as long as we're ok with the trace_sched_wakeup
> > > > > probe.
> > > > 
> > > > Last time I tried this, there were architectures that could have odd-valued
> > > > function addresses.  Maybe this is no longer the case?
> > > 
> > > Oh ok! If this happens, maybe we can just make it depend on x86-64 assuming
> > > x86-64 does not have pointer oddness. We can also add a warning for if the
> > > function address is odd before setting the bit.
> > 
> > Let me rephrase this...  ;-)
> > 
> > Given that this used to not work and still might not work, let's see
> > if we can find some other way to debug this.  Unless and until it can
> > be demonstrated that there is no supported compiler that will generated
> > odd-valued function addresses on any supported architecture.
> > 
> > Plus there was a time that x86 did odd-valued pointer addresses.
> > The instruction set is plenty fine with this, so it would have to be a
> > compiler and assembly-language convention to avoid it.
> 
> Ok, so then I am not sure how to make it work in production at the moment. I
> could track the lazy callbacks in a hashtable but then that's overhead.
> 
> Or, I could focus on trying Vlad's config and figure out what's going on and
> keep the auto-debug for later.

For one thing, experience with manual debugging might inform later
auto-debugging efforts.

> On another thought, this is the sort of thing that should be doable via Daniel
> Bristot's runtime verification framework, as its a classical "see if these
> traces look right" issue which should be teachable to a computer with a few rules.

Worth a shot!  Failing that, there is always BPF.  ;-)

								Thanx, Paul

Uladzislau Rezki Sept. 27, 2022, 9:31 p.m. UTC | #50

On Tue, Sep 27, 2022 at 05:13:34PM +0200, Uladzislau Rezki wrote:
> On Tue, Sep 27, 2022 at 04:59:44PM +0200, Uladzislau Rezki wrote:
> > On Tue, Sep 27, 2022 at 02:30:03PM +0000, Joel Fernandes wrote:
> > > On Tue, Sep 27, 2022 at 04:08:18PM +0200, Uladzislau Rezki wrote:
> > > > On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
> > > > > Hi Vlad,
> > > > > 
> > > > > On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
> > > > > [...]
> > > > > > > > On my KVM machine the boot time is affected:
> > > > > > > > 
> > > > > > > > <snip>
> > > > > > > > [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > > > > > [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > > > > > [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> > > > > > > > [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
> > > > > > > > [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
> > > > > > > > [  104.115418] process '/usr/bin/fstype' started with executable stack
> > > > > > > > [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
> > > > > > > > [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
> > > > > > > > [  104.340193] systemd[1]: Detected virtualization kvm.
> > > > > > > > [  104.340196] systemd[1]: Detected architecture x86-64.
> > > > > > > > [  104.359032] systemd[1]: Set hostname to <pc638>.
> > > > > > > > [  105.740109] random: crng init done
> > > > > > > > [  105.741267] systemd[1]: Reached target Remote File Systems.
> > > > > > > > <snip>
> > > > > > > > 
> > > > > > > > 2 - 11 and second delay is between 32 - 104. So there are still users which must
> > > > > > > > be waiting for "RCU" in a sync way.
> > > > > > > 
> > > > > > > I was wondering if you can compare boot logs and see which timestamp does the
> > > > > > > slow down start from. That way, we can narrow down the callback. Also another
> > > > > > > idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
> > > > > > > ftrace_dump_on_oops" to the boot params, and then manually call
> > > > > > > "tracing_off(); panic();" from the code at the first printk that seems off in
> > > > > > > your comparison of good vs bad. For example, if "crng init done" timestamp is
> > > > > > > off, put the "tracing_off(); panic();" there. Then grab the serial console
> > > > > > > output to see what were the last callbacks that was queued/invoked.
> > > > > 
> > > > > Would you be willing to try these steps? Meanwhile I will try on my side as
> > > > > well with the .config you sent me in another email.
> > > > >
> > > > Not exactly those steps. But see below:
> > > > 
> > > > <snip>
> > > > [    2.291319] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> > > > [   17.302946] e1000 0000:00:03.0 ens3: renamed from eth0
> > > > <snip>
> > > > 
> > > > 15 seconds delay between two prints. I have logged all call_rcu() users
> > > > between those two prints:
> > > > 
> > > > <snip>
> > > > # tracer: nop
> > > > #
> > > > # entries-in-buffer/entries-written: 166/166   #P:64
> > > > #
> > > > #                                _-----=> irqs-off/BH-disabled
> > > > #                               / _----=> need-resched
> > > > #                              | / _---=> hardirq/softirq
> > > > #                              || / _--=> preempt-depth
> > > > #                              ||| / _-=> migrate-disable
> > > > #                              |||| /     delay
> > > > #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> > > > #              | |         |   |||||     |         |
> > > >    systemd-udevd-669     [002] .....     2.338739: e1000_probe: Intel(R) PRO/1000 Network Connection
> > > >    systemd-udevd-665     [061] .....     2.338952: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....     2.338962: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-665     [061] .....     2.338965: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....     2.338968: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-665     [061] .....     2.338987: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-645     [053] .....     2.338989: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....     2.338999: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-645     [053] .....     2.339002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >      kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
> > > >          rcuop/0-17      [000] b....     6.337320: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
> > > >     kworker/38:1-744     [038] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f97e40a0
> > > >            <...>-739     [035] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8be40a0
> > > >            <...>-732     [021] d..1.     6.841486: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f53e40a0
> > > >     kworker/36:1-740     [036] d..1.     6.841487: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8fe40a0
> > > >         rcuop/21-170     [023] b....     6.849276: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
> > > >         rcuop/38-291     [052] b....     6.849950: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/38-291     [052] b....     6.849957: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >      kworker/5:1-712     [005] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f13e40a0
> > > >     kworker/19:1-727     [019] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f4be40a0
> > > >            <...>-719     [007] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1be40a0
> > > >     kworker/13:1-721     [013] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f33e40a0
> > > >     kworker/52:1-756     [052] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcfe40a0
> > > >     kworker/29:1-611     [029] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f73e40a0
> > > >            <...>-754     [049] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc3e40a0
> > > >     kworker/12:1-726     [012] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f2fe40a0
> > > >     kworker/53:1-710     [053] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd3e40a0
> > > >            <...>-762     [061] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ff3e40a0
> > > >            <...>-757     [054] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd7e40a0
> > > >     kworker/25:1-537     [025] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f63e40a0
> > > >            <...>-714     [004] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0fe40a0
> > > >            <...>-749     [044] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fafe40a0
> > > >     kworker/51:1-755     [051] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcbe40a0
> > > >            <...>-764     [063] d..1.     7.097415: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ffbe40a0
> > > >            <...>-753     [045] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fb3e40a0
> > > >     kworker/43:1-748     [043] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fabe40a0
> > > >     kworker/41:1-747     [041] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa3e40a0
> > > >     kworker/57:1-760     [057] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe3e40a0
> > > >            <...>-720     [008] d..1.     7.097418: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1fe40a0
> > > >     kworker/58:1-759     [058] d..1.     7.097421: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe7e40a0
> > > >     kworker/16:1-728     [016] d..1.     7.097424: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f3fe40a0
> > > >            <...>-722     [010] d..1.     7.097427: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f27e40a0
> > > >     kworker/22:1-733     [022] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f57e40a0
> > > >            <...>-731     [026] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f67e40a0
> > > >            <...>-752     [048] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fbfe40a0
> > > >     kworker/18:0-147     [018] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f47e40a0
> > > >     kworker/39:1-745     [039] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f9be40a0
> > > >            <...>-716     [003] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0be40a0
> > > >            <...>-703     [050] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc7e40a0
> > > >     kworker/42:1-746     [042] d..1.     7.097444: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa7e40a0
> > > >         rcuop/13-113     [013] b....     7.105592: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/13-113     [013] b....     7.105595: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/10-92      [040] b....     7.105608: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/10-92      [040] b....     7.105610: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/16-135     [023] b....     7.105613: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/8-78      [039] b....     7.105636: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/8-78      [039] b....     7.105640: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/12-106     [040] b....     7.105651: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/12-106     [040] b....     7.105652: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/19-156     [000] b....     7.105727: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/19-156     [000] b....     7.105730: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/5-56      [058] b....     7.105808: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/20-163     [023] b....    17.345655: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/14-120     [013] b....    17.345675: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/14-120     [013] b....    17.345681: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/6-63      [013] b....    17.345714: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/6-63      [013] b....    17.345715: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/9-85      [000] b....    17.345753: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/9-85      [000] b....    17.345758: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/17-142     [000] b....    17.345775: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/17-142     [000] b....    17.345776: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/17-142     [000] b....    17.345777: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/11-99      [000] b....    17.345810: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/11-99      [000] b....    17.345811: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/15-127     [013] b....    17.345832: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/15-127     [013] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/1-28      [000] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >          rcuop/1-28      [000] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/15-127     [013] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >         rcuop/15-127     [013] b....    17.345837: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
> > > >    systemd-udevd-633     [035] .....    17.346591: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-633     [035] .....    17.346609: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-633     [035] .....    17.346659: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-633     [035] .....    17.346666: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-669     [002] .....    17.347573: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >      kworker/2:2-769     [002] .....    17.347659: __call_rcu_common: -> 0x0: __wait_rcu_gp+0xff/0x120 <- 0x0
> > > >    systemd-udevd-675     [012] .....    17.347981: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-675     [012] .....    17.348002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-675     [012] .....    17.348037: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348098: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348117: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-665     [061] .....    17.348120: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348156: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348166: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348176: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-642     [050] .....    17.348179: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348186: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348197: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348200: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348231: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348240: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348250: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348259: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348262: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348305: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348317: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348332: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348336: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348394: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348403: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348406: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....    17.348503: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....    17.348531: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-645     [053] .....    17.348535: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348536: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....    17.348563: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-665     [061] .....    17.348575: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348628: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348704: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348828: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348884: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348904: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348954: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348983: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.348993: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349014: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349024: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349026: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349119: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349182: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349243: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349430: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349462: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349472: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349483: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349486: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349583: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349632: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349666: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349699: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....    17.349727: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349733: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....    17.349739: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-645     [053] .....    17.349742: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349765: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-645     [053] .....    17.349766: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-642     [050] .....    17.349780: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-665     [061] .....    17.349800: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-642     [050] .....    17.349815: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-642     [050] .....    17.349829: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-642     [050] .....    17.349832: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349834: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-675     [012] .....    17.349835: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-675     [012] .....    17.349853: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-642     [050] .....    17.349861: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >    systemd-udevd-675     [012] .....    17.349873: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.349879: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.350007: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.350011: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.350080: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.350175: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
> > > >    systemd-udevd-665     [061] .....    17.350362: dev_change_name: --> renamed from eth0
> > > > <snip>
> > > > 
> > > > First delay:
> > > > 
> > > > <snip>
> > > > systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
> > > >   kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
> > > > <snip>
> > > > 
> > > > __dentry_kill() function and after 4 seconds there is another one queue_rcu_work().
> > > > I have checked the __dentry_kill() if it can do any sync talk with RCU but from the
> > > > first glance i do not see anything critical. But more attention is required.
> > > 
> > > Can you log rcu_barrier() as well? It could be that the print is just a side
> > > effect of something else that is not being printed.
> > > 
> > It has nothing to do with rcu_barrier() in my case. Also i have checked
> > the synchronize_rcu() it also works as expected, i.e. it is not a
> > blocking reason.
> > 
> > Have you tried my config?
> > 
> > --
> > Uladzislau Rezki
> OK. Seems one place i have spot:
> 
> <snip>
> [    7.074847] calling  init_sr+0x0/0x1000 [sr_mod] @ 668
> [   22.422808] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
> [   22.422815] cdrom: Uniform CD-ROM driver Revision: 3.20
> [   32.664590] sr 1:0:0:0: Attached scsi CD-ROM sr0
> [   32.664642] initcall init_sr+0x0/0x1000 [sr_mod] returned 0 after 25589786 usecs
> <snip>
> 
> --
> Uladzislau Rezki

OK. Found the boot up issue. In my case i had 120 seconds delay:

<snip>
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 448748e3fba5..a56cfd612e3a 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -312,7 +312,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
         * Ensure that all tasks observe the host state change before the
         * host_failed change.
         */
-       call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
+       call_rcu_flush(&scmd->rcu, scsi_eh_inc_host_failed);
 }
 
 /**
<snip>

After this change the boot-up time settles back to normal 4 seconds.

--
Uladzislau Rezki

Joel Fernandes Sept. 27, 2022, 10:05 p.m. UTC | #51

> On Sep 27, 2022, at 5:31 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> On Tue, Sep 27, 2022 at 05:13:34PM +0200, Uladzislau Rezki wrote:
>>> On Tue, Sep 27, 2022 at 04:59:44PM +0200, Uladzislau Rezki wrote:
>>> On Tue, Sep 27, 2022 at 02:30:03PM +0000, Joel Fernandes wrote:
>>>> On Tue, Sep 27, 2022 at 04:08:18PM +0200, Uladzislau Rezki wrote:
>>>>> On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
>>>>>> Hi Vlad,
>>>>>> 
>>>>>> On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
>>>>>> [...]
>>>>>>>>> On my KVM machine the boot time is affected:
>>>>>>>>> 
>>>>>>>>> <snip>
>>>>>>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
>>>>>>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
>>>>>>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
>>>>>>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
>>>>>>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
>>>>>>>>> [  104.115418] process '/usr/bin/fstype' started with executable stack
>>>>>>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
>>>>>>>>> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
>>>>>>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
>>>>>>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
>>>>>>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
>>>>>>>>> [  105.740109] random: crng init done
>>>>>>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
>>>>>>>>> <snip>
>>>>>>>>> 
>>>>>>>>> 2 - 11 and second delay is between 32 - 104. So there are still users which must
>>>>>>>>> be waiting for "RCU" in a sync way.
>>>>>>>> 
>>>>>>>> I was wondering if you can compare boot logs and see which timestamp does the
>>>>>>>> slow down start from. That way, we can narrow down the callback. Also another
>>>>>>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
>>>>>>>> ftrace_dump_on_oops" to the boot params, and then manually call
>>>>>>>> "tracing_off(); panic();" from the code at the first printk that seems off in
>>>>>>>> your comparison of good vs bad. For example, if "crng init done" timestamp is
>>>>>>>> off, put the "tracing_off(); panic();" there. Then grab the serial console
>>>>>>>> output to see what were the last callbacks that was queued/invoked.
>>>>>> 
>>>>>> Would you be willing to try these steps? Meanwhile I will try on my side as
>>>>>> well with the .config you sent me in another email.
>>>>>> 
>>>>> Not exactly those steps. But see below:
>>>>> 
>>>>> <snip>
>>>>> [    2.291319] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
>>>>> [   17.302946] e1000 0000:00:03.0 ens3: renamed from eth0
>>>>> <snip>
>>>>> 
>>>>> 15 seconds delay between two prints. I have logged all call_rcu() users
>>>>> between those two prints:
>>>>> 
>>>>> <snip>
>>>>> # tracer: nop
>>>>> #
>>>>> # entries-in-buffer/entries-written: 166/166   #P:64
>>>>> #
>>>>> #                                _-----=> irqs-off/BH-disabled
>>>>> #                               / _----=> need-resched
>>>>> #                              | / _---=> hardirq/softirq
>>>>> #                              || / _--=> preempt-depth
>>>>> #                              ||| / _-=> migrate-disable
>>>>> #                              |||| /     delay
>>>>> #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
>>>>> #              | |         |   |||||     |         |
>>>>>   systemd-udevd-669     [002] .....     2.338739: e1000_probe: Intel(R) PRO/1000 Network Connection
>>>>>   systemd-udevd-665     [061] .....     2.338952: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....     2.338962: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-665     [061] .....     2.338965: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....     2.338968: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-665     [061] .....     2.338987: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-645     [053] .....     2.338989: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....     2.338999: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-645     [053] .....     2.339002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>     kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
>>>>>         rcuop/0-17      [000] b....     6.337320: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
>>>>>    kworker/38:1-744     [038] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f97e40a0
>>>>>           <...>-739     [035] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8be40a0
>>>>>           <...>-732     [021] d..1.     6.841486: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f53e40a0
>>>>>    kworker/36:1-740     [036] d..1.     6.841487: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8fe40a0
>>>>>        rcuop/21-170     [023] b....     6.849276: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
>>>>>        rcuop/38-291     [052] b....     6.849950: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/38-291     [052] b....     6.849957: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>     kworker/5:1-712     [005] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f13e40a0
>>>>>    kworker/19:1-727     [019] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f4be40a0
>>>>>           <...>-719     [007] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1be40a0
>>>>>    kworker/13:1-721     [013] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f33e40a0
>>>>>    kworker/52:1-756     [052] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcfe40a0
>>>>>    kworker/29:1-611     [029] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f73e40a0
>>>>>           <...>-754     [049] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc3e40a0
>>>>>    kworker/12:1-726     [012] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f2fe40a0
>>>>>    kworker/53:1-710     [053] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd3e40a0
>>>>>           <...>-762     [061] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ff3e40a0
>>>>>           <...>-757     [054] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd7e40a0
>>>>>    kworker/25:1-537     [025] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f63e40a0
>>>>>           <...>-714     [004] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0fe40a0
>>>>>           <...>-749     [044] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fafe40a0
>>>>>    kworker/51:1-755     [051] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcbe40a0
>>>>>           <...>-764     [063] d..1.     7.097415: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ffbe40a0
>>>>>           <...>-753     [045] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fb3e40a0
>>>>>    kworker/43:1-748     [043] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fabe40a0
>>>>>    kworker/41:1-747     [041] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa3e40a0
>>>>>    kworker/57:1-760     [057] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe3e40a0
>>>>>           <...>-720     [008] d..1.     7.097418: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1fe40a0
>>>>>    kworker/58:1-759     [058] d..1.     7.097421: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe7e40a0
>>>>>    kworker/16:1-728     [016] d..1.     7.097424: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f3fe40a0
>>>>>           <...>-722     [010] d..1.     7.097427: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f27e40a0
>>>>>    kworker/22:1-733     [022] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f57e40a0
>>>>>           <...>-731     [026] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f67e40a0
>>>>>           <...>-752     [048] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fbfe40a0
>>>>>    kworker/18:0-147     [018] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f47e40a0
>>>>>    kworker/39:1-745     [039] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f9be40a0
>>>>>           <...>-716     [003] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0be40a0
>>>>>           <...>-703     [050] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc7e40a0
>>>>>    kworker/42:1-746     [042] d..1.     7.097444: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa7e40a0
>>>>>        rcuop/13-113     [013] b....     7.105592: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/13-113     [013] b....     7.105595: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/10-92      [040] b....     7.105608: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/10-92      [040] b....     7.105610: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/16-135     [023] b....     7.105613: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/8-78      [039] b....     7.105636: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/8-78      [039] b....     7.105640: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/12-106     [040] b....     7.105651: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/12-106     [040] b....     7.105652: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/19-156     [000] b....     7.105727: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/19-156     [000] b....     7.105730: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/5-56      [058] b....     7.105808: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/20-163     [023] b....    17.345655: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/14-120     [013] b....    17.345675: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/14-120     [013] b....    17.345681: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/6-63      [013] b....    17.345714: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/6-63      [013] b....    17.345715: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/9-85      [000] b....    17.345753: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/9-85      [000] b....    17.345758: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/17-142     [000] b....    17.345775: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/17-142     [000] b....    17.345776: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/17-142     [000] b....    17.345777: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/11-99      [000] b....    17.345810: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/11-99      [000] b....    17.345811: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/15-127     [013] b....    17.345832: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/15-127     [013] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/1-28      [000] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>         rcuop/1-28      [000] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/15-127     [013] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>        rcuop/15-127     [013] b....    17.345837: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>   systemd-udevd-633     [035] .....    17.346591: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-633     [035] .....    17.346609: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-633     [035] .....    17.346659: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-633     [035] .....    17.346666: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-669     [002] .....    17.347573: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>     kworker/2:2-769     [002] .....    17.347659: __call_rcu_common: -> 0x0: __wait_rcu_gp+0xff/0x120 <- 0x0
>>>>>   systemd-udevd-675     [012] .....    17.347981: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-675     [012] .....    17.348002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-675     [012] .....    17.348037: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348098: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348117: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-665     [061] .....    17.348120: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348156: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348166: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348176: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-642     [050] .....    17.348179: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348186: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348197: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348200: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348231: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348240: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348250: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348259: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348262: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348305: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348317: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348332: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348336: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348394: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348403: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348406: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....    17.348503: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....    17.348531: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-645     [053] .....    17.348535: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348536: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....    17.348563: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-665     [061] .....    17.348575: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348628: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348704: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348828: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348884: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348904: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348954: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348983: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.348993: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349014: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349024: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349026: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349119: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349182: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349243: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349430: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349462: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349472: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349483: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349486: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349583: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349632: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349666: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349699: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....    17.349727: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349733: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....    17.349739: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-645     [053] .....    17.349742: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349765: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-645     [053] .....    17.349766: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-642     [050] .....    17.349780: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-665     [061] .....    17.349800: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-642     [050] .....    17.349815: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-642     [050] .....    17.349829: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-642     [050] .....    17.349832: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349834: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-675     [012] .....    17.349835: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-675     [012] .....    17.349853: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-642     [050] .....    17.349861: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>   systemd-udevd-675     [012] .....    17.349873: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.349879: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.350007: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.350011: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.350080: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.350175: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>   systemd-udevd-665     [061] .....    17.350362: dev_change_name: --> renamed from eth0
>>>>> <snip>
>>>>> 
>>>>> First delay:
>>>>> 
>>>>> <snip>
>>>>> systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>  kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
>>>>> <snip>
>>>>> 
>>>>> __dentry_kill() function and after 4 seconds there is another one queue_rcu_work().
>>>>> I have checked the __dentry_kill() if it can do any sync talk with RCU but from the
>>>>> first glance i do not see anything critical. But more attention is required.
>>>> 
>>>> Can you log rcu_barrier() as well? It could be that the print is just a side
>>>> effect of something else that is not being printed.
>>>> 
>>> It has nothing to do with rcu_barrier() in my case. Also i have checked
>>> the synchronize_rcu() it also works as expected, i.e. it is not a
>>> blocking reason.
>>> 
>>> Have you tried my config?
>>> 
>>> --
>>> Uladzislau Rezki
>> OK. Seems one place i have spot:
>> 
>> <snip>
>> [    7.074847] calling  init_sr+0x0/0x1000 [sr_mod] @ 668
>> [   22.422808] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
>> [   22.422815] cdrom: Uniform CD-ROM driver Revision: 3.20
>> [   32.664590] sr 1:0:0:0: Attached scsi CD-ROM sr0
>> [   32.664642] initcall init_sr+0x0/0x1000 [sr_mod] returned 0 after 25589786 usecs
>> <snip>
>> 
>> --
>> Uladzislau Rezki
> 
> OK. Found the boot up issue. In my case i had 120 seconds delay:

Wow, nice work.

> <snip>
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 448748e3fba5..a56cfd612e3a 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -312,7 +312,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
>         * Ensure that all tasks observe the host state change before the
>         * host_failed change.
>         */
> -       call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
> +       call_rcu_flush(&scmd->rcu, scsi_eh_inc_host_failed);

Great! Thanks. I’ll include this and the other one you converted in the next revision.

Thanks,

  - Joel 

> }
> 
> /**
> <snip>
> 
> After this change the boot-up time settles back to normal 4 seconds.
> 
> --
> Uladzislau Rezki

Joel Fernandes Sept. 27, 2022, 10:29 p.m. UTC | #52

> On Sep 27, 2022, at 6:05 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> 
>> On Sep 27, 2022, at 5:31 PM, Uladzislau Rezki <urezki@gmail.com> wrote:
>> 
>> On Tue, Sep 27, 2022 at 05:13:34PM +0200, Uladzislau Rezki wrote:
>>>> On Tue, Sep 27, 2022 at 04:59:44PM +0200, Uladzislau Rezki wrote:
>>>> On Tue, Sep 27, 2022 at 02:30:03PM +0000, Joel Fernandes wrote:
>>>>> On Tue, Sep 27, 2022 at 04:08:18PM +0200, Uladzislau Rezki wrote:
>>>>>> On Mon, Sep 26, 2022 at 08:54:27PM +0000, Joel Fernandes wrote:
>>>>>>> Hi Vlad,
>>>>>>> 
>>>>>>> On Mon, Sep 26, 2022 at 09:39:23PM +0200, Uladzislau Rezki wrote:
>>>>>>> [...]
>>>>>>>>>> On my KVM machine the boot time is affected:
>>>>>>>>>> 
>>>>>>>>>> <snip>
>>>>>>>>>> [    2.273406] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
>>>>>>>>>> [   11.945283] e1000 0000:00:03.0 ens3: renamed from eth0
>>>>>>>>>> [   22.165198] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
>>>>>>>>>> [   22.165206] cdrom: Uniform CD-ROM driver Revision: 3.20
>>>>>>>>>> [   32.406981] sr 1:0:0:0: Attached scsi CD-ROM sr0
>>>>>>>>>> [  104.115418] process '/usr/bin/fstype' started with executable stack
>>>>>>>>>> [  104.170142] EXT4-fs (sda1): mounted filesystem with ordered data mode. Quota mode: none.
>>>>>>>>>> [  104.340125] systemd[1]: systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
>>>>>>>>>> [  104.340193] systemd[1]: Detected virtualization kvm.
>>>>>>>>>> [  104.340196] systemd[1]: Detected architecture x86-64.
>>>>>>>>>> [  104.359032] systemd[1]: Set hostname to <pc638>.
>>>>>>>>>> [  105.740109] random: crng init done
>>>>>>>>>> [  105.741267] systemd[1]: Reached target Remote File Systems.
>>>>>>>>>> <snip>
>>>>>>>>>> 
>>>>>>>>>> 2 - 11 and second delay is between 32 - 104. So there are still users which must
>>>>>>>>>> be waiting for "RCU" in a sync way.
>>>>>>>>> 
>>>>>>>>> I was wondering if you can compare boot logs and see which timestamp does the
>>>>>>>>> slow down start from. That way, we can narrow down the callback. Also another
>>>>>>>>> idea is, add "trace_event=rcu:rcu_callback,rcu:rcu_invoke_callback
>>>>>>>>> ftrace_dump_on_oops" to the boot params, and then manually call
>>>>>>>>> "tracing_off(); panic();" from the code at the first printk that seems off in
>>>>>>>>> your comparison of good vs bad. For example, if "crng init done" timestamp is
>>>>>>>>> off, put the "tracing_off(); panic();" there. Then grab the serial console
>>>>>>>>> output to see what were the last callbacks that was queued/invoked.
>>>>>>> 
>>>>>>> Would you be willing to try these steps? Meanwhile I will try on my side as
>>>>>>> well with the .config you sent me in another email.
>>>>>>> 
>>>>>> Not exactly those steps. But see below:
>>>>>> 
>>>>>> <snip>
>>>>>> [    2.291319] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
>>>>>> [   17.302946] e1000 0000:00:03.0 ens3: renamed from eth0
>>>>>> <snip>
>>>>>> 
>>>>>> 15 seconds delay between two prints. I have logged all call_rcu() users
>>>>>> between those two prints:
>>>>>> 
>>>>>> <snip>
>>>>>> # tracer: nop
>>>>>> #
>>>>>> # entries-in-buffer/entries-written: 166/166   #P:64
>>>>>> #
>>>>>> #                                _-----=> irqs-off/BH-disabled
>>>>>> #                               / _----=> need-resched
>>>>>> #                              | / _---=> hardirq/softirq
>>>>>> #                              || / _--=> preempt-depth
>>>>>> #                              ||| / _-=> migrate-disable
>>>>>> #                              |||| /     delay
>>>>>> #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
>>>>>> #              | |         |   |||||     |         |
>>>>>>  systemd-udevd-669     [002] .....     2.338739: e1000_probe: Intel(R) PRO/1000 Network Connection
>>>>>>  systemd-udevd-665     [061] .....     2.338952: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....     2.338962: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-665     [061] .....     2.338965: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....     2.338968: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-665     [061] .....     2.338987: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-645     [053] .....     2.338989: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....     2.338999: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-645     [053] .....     2.339002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>    kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
>>>>>>        rcuop/0-17      [000] b....     6.337320: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
>>>>>>   kworker/38:1-744     [038] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f97e40a0
>>>>>>          <...>-739     [035] d..1.     6.841479: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8be40a0
>>>>>>          <...>-732     [021] d..1.     6.841486: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f53e40a0
>>>>>>   kworker/36:1-740     [036] d..1.     6.841487: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f8fe40a0
>>>>>>       rcuop/21-170     [023] b....     6.849276: __call_rcu_common: -> 0x0: exit_creds+0x63/0x70 <- 0x0
>>>>>>       rcuop/38-291     [052] b....     6.849950: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/38-291     [052] b....     6.849957: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>    kworker/5:1-712     [005] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f13e40a0
>>>>>>   kworker/19:1-727     [019] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f4be40a0
>>>>>>          <...>-719     [007] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1be40a0
>>>>>>   kworker/13:1-721     [013] d..1.     7.097392: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f33e40a0
>>>>>>   kworker/52:1-756     [052] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcfe40a0
>>>>>>   kworker/29:1-611     [029] d..1.     7.097395: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f73e40a0
>>>>>>          <...>-754     [049] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc3e40a0
>>>>>>   kworker/12:1-726     [012] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f2fe40a0
>>>>>>   kworker/53:1-710     [053] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd3e40a0
>>>>>>          <...>-762     [061] d..1.     7.097405: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ff3e40a0
>>>>>>          <...>-757     [054] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fd7e40a0
>>>>>>   kworker/25:1-537     [025] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f63e40a0
>>>>>>          <...>-714     [004] d..1.     7.097408: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0fe40a0
>>>>>>          <...>-749     [044] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fafe40a0
>>>>>>   kworker/51:1-755     [051] d..1.     7.097413: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fcbe40a0
>>>>>>          <...>-764     [063] d..1.     7.097415: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70ffbe40a0
>>>>>>          <...>-753     [045] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fb3e40a0
>>>>>>   kworker/43:1-748     [043] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fabe40a0
>>>>>>   kworker/41:1-747     [041] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa3e40a0
>>>>>>   kworker/57:1-760     [057] d..1.     7.097416: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe3e40a0
>>>>>>          <...>-720     [008] d..1.     7.097418: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f1fe40a0
>>>>>>   kworker/58:1-759     [058] d..1.     7.097421: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fe7e40a0
>>>>>>   kworker/16:1-728     [016] d..1.     7.097424: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f3fe40a0
>>>>>>          <...>-722     [010] d..1.     7.097427: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f27e40a0
>>>>>>   kworker/22:1-733     [022] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f57e40a0
>>>>>>          <...>-731     [026] d..1.     7.097432: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f67e40a0
>>>>>>          <...>-752     [048] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fbfe40a0
>>>>>>   kworker/18:0-147     [018] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f47e40a0
>>>>>>   kworker/39:1-745     [039] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f9be40a0
>>>>>>          <...>-716     [003] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70f0be40a0
>>>>>>          <...>-703     [050] d..1.     7.097437: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fc7e40a0
>>>>>>   kworker/42:1-746     [042] d..1.     7.097444: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70fa7e40a0
>>>>>>       rcuop/13-113     [013] b....     7.105592: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/13-113     [013] b....     7.105595: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/10-92      [040] b....     7.105608: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/10-92      [040] b....     7.105610: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/16-135     [023] b....     7.105613: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/8-78      [039] b....     7.105636: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/8-78      [039] b....     7.105640: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/12-106     [040] b....     7.105651: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/12-106     [040] b....     7.105652: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/19-156     [000] b....     7.105727: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/19-156     [000] b....     7.105730: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/5-56      [058] b....     7.105808: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/5-56      [058] b....     7.105814: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/20-163     [023] b....    17.345648: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/20-163     [023] b....    17.345655: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/14-120     [013] b....    17.345675: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/14-120     [013] b....    17.345681: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/6-63      [013] b....    17.345714: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/6-63      [013] b....    17.345715: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/9-85      [000] b....    17.345753: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/9-85      [000] b....    17.345758: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/17-142     [000] b....    17.345775: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/17-142     [000] b....    17.345776: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/17-142     [000] b....    17.345777: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/11-99      [000] b....    17.345810: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/11-99      [000] b....    17.345811: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/15-127     [013] b....    17.345832: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/15-127     [013] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/1-28      [000] b....    17.345834: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>        rcuop/1-28      [000] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/15-127     [013] b....    17.345835: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>       rcuop/15-127     [013] b....    17.345837: __call_rcu_common: -> 0x0: file_free_rcu+0x32/0x50 <- 0x0
>>>>>>  systemd-udevd-633     [035] .....    17.346591: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-633     [035] .....    17.346609: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-633     [035] .....    17.346659: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-633     [035] .....    17.346666: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-669     [002] .....    17.347573: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>    kworker/2:2-769     [002] .....    17.347659: __call_rcu_common: -> 0x0: __wait_rcu_gp+0xff/0x120 <- 0x0
>>>>>>  systemd-udevd-675     [012] .....    17.347981: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-675     [012] .....    17.348002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-675     [012] .....    17.348037: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348098: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348117: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-665     [061] .....    17.348120: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348156: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348166: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348176: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-642     [050] .....    17.348179: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348186: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348197: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348200: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348231: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348240: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348250: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348259: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348262: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348305: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348317: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348332: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348336: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348394: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348403: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348406: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....    17.348503: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....    17.348531: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-645     [053] .....    17.348535: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348536: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....    17.348563: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-665     [061] .....    17.348575: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348628: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348704: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348828: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348884: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348904: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348954: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348983: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.348993: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349002: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349014: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349024: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349026: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349119: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349182: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349243: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349430: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349462: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349472: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349483: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349486: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349583: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349632: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349666: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349699: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....    17.349727: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349733: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....    17.349739: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-645     [053] .....    17.349742: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349765: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-645     [053] .....    17.349766: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-642     [050] .....    17.349780: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-665     [061] .....    17.349800: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-642     [050] .....    17.349815: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-642     [050] .....    17.349829: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-642     [050] .....    17.349832: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349834: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-675     [012] .....    17.349835: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-675     [012] .....    17.349853: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-642     [050] .....    17.349861: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>>  systemd-udevd-675     [012] .....    17.349873: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.349879: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.350007: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.350011: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.350080: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.350175: __call_rcu_common: -> ____fput+0x0/0x10: task_work_run+0x5c/0x90 <- 0x0
>>>>>>  systemd-udevd-665     [061] .....    17.350362: dev_change_name: --> renamed from eth0
>>>>>> <snip>
>>>>>> 
>>>>>> First delay:
>>>>>> 
>>>>>> <snip>
>>>>>> systemd-udevd-645     [053] .....     2.339024: __call_rcu_common: -> 0x0: __dentry_kill+0x140/0x180 <- 0x2
>>>>>> kworker/0:3-546     [000] d..1.     6.329516: __call_rcu_common: -> 0x0: queue_rcu_work+0x2b/0x40 <- 0xffff8c70effe40a0
>>>>>> <snip>
>>>>>> 
>>>>>> __dentry_kill() function and after 4 seconds there is another one queue_rcu_work().
>>>>>> I have checked the __dentry_kill() if it can do any sync talk with RCU but from the
>>>>>> first glance i do not see anything critical. But more attention is required.
>>>>> 
>>>>> Can you log rcu_barrier() as well? It could be that the print is just a side
>>>>> effect of something else that is not being printed.
>>>>> 
>>>> It has nothing to do with rcu_barrier() in my case. Also i have checked
>>>> the synchronize_rcu() it also works as expected, i.e. it is not a
>>>> blocking reason.
>>>> 
>>>> Have you tried my config?
>>>> 
>>>> --
>>>> Uladzislau Rezki
>>> OK. Seems one place i have spot:
>>> 
>>> <snip>
>>> [    7.074847] calling  init_sr+0x0/0x1000 [sr_mod] @ 668
>>> [   22.422808] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
>>> [   22.422815] cdrom: Uniform CD-ROM driver Revision: 3.20
>>> [   32.664590] sr 1:0:0:0: Attached scsi CD-ROM sr0
>>> [   32.664642] initcall init_sr+0x0/0x1000 [sr_mod] returned 0 after 25589786 usecs
>>> <snip>
>>> 
>>> --
>>> Uladzislau Rezki
>> 
>> OK. Found the boot up issue. In my case i had 120 seconds delay:
> 
> Wow, nice work.
> 
>> <snip>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index 448748e3fba5..a56cfd612e3a 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -312,7 +312,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
>>        * Ensure that all tasks observe the host state change before the
>>        * host_failed change.
>>        */
>> -       call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
>> +       call_rcu_flush(&scmd->rcu, scsi_eh_inc_host_failed);
> 
> Great! Thanks. I’ll include this and the other one you converted in the next revision.

By the way, any chance you could check android as well, just to rule out any trouble markers? ChromeOS and your Linux distro are doing well on boot so that’s a good sign.

(Also let’s start trimming emails before Steven starts sending out nastygrams ;-)).

 Thanks,

 - Joel



> 
> Thanks,
> 
>  - Joel 
> 
>> }
>> 
>> /**
>> <snip>
>> 
>> After this change the boot-up time settles back to normal 4 seconds.
>> 
>> --
>> Uladzislau Rezki

Uladzislau Rezki Sept. 30, 2022, 4:11 p.m. UTC | #53

> >> 
> >> OK. Found the boot up issue. In my case i had 120 seconds delay:
> > 
> > Wow, nice work.
> > 
> >> <snip>
> >> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> >> index 448748e3fba5..a56cfd612e3a 100644
> >> --- a/drivers/scsi/scsi_error.c
> >> +++ b/drivers/scsi/scsi_error.c
> >> @@ -312,7 +312,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
> >>        * Ensure that all tasks observe the host state change before the
> >>        * host_failed change.
> >>        */
> >> -       call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
> >> +       call_rcu_flush(&scmd->rcu, scsi_eh_inc_host_failed);
> > 
> > Great! Thanks. I’ll include this and the other one you converted in the next revision.
> 
> By the way, any chance you could check android as well, just to rule out any trouble markers? ChromeOS and your Linux distro are doing well on boot so that’s a good sign.
> 
I will check v6 on Android. I will get back shortly.

> (Also let’s start trimming emails before Steven starts sending out nastygrams ;-)).
> 
Done :)

--
Uladzislau Rezki

Joel Fernandes Oct. 3, 2022, 7:33 p.m. UTC | #54

On Mon, Sep 26, 2022 at 04:53:51PM -0700, Paul E. McKenney wrote:
> On Mon, Sep 26, 2022 at 07:33:17PM -0400, Joel Fernandes wrote:
> > 
> > 
> > > On Sep 26, 2022, at 6:37 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > 
> > > On Mon, Sep 26, 2022 at 09:07:12PM +0000, Joel Fernandes wrote:
> > >> Hi Paul,
> > >> 
> > >> On Mon, Sep 26, 2022 at 10:42:40AM -0700, Paul E. McKenney wrote:
> > >> [..]
> > >>>>>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
> > >>>>>>>> +    } else {
> > >>>>>>>> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > >>>>>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
> > >>>>>>> 
> > >>>>>>> This WRITE_ONCE() can be dropped out of the "if" statement, correct?
> > >>>>>> 
> > >>>>>> Yes will update.
> > >>>>> 
> > >>>>> Thank you!
> > >>>>> 
> > >>>>>>> If so, this could be an "if" statement with two statements in its "then"
> > >>>>>>> clause, no "else" clause, and two statements following the "if" statement.
> > >>>>>> 
> > >>>>>> I don’t think we can get rid of the else part but I’ll see what it looks like.
> > >>>>> 
> > >>>>> In the function header, s/rhp/rhp_in/, then:
> > >>>>> 
> > >>>>>    struct rcu_head *rhp = rhp_in;
> > >>>>> 
> > >>>>> And then:
> > >>>>> 
> > >>>>>    if (lazy && rhp) {
> > >>>>>        rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> > >>>>>        rhp = NULL;
> > >>>> 
> > >>>> This enqueues on to the bypass list, where as if lazy && rhp, I want to queue
> > >>>> the new rhp on to the main cblist. So the pseudo code in my patch is:
> > >>>> 
> > >>>> if (lazy and rhp) then
> > >>>>    1. flush bypass CBs on to main list.
> > >>>>    2. queue new CB on to main list.
> > >>> 
> > >>> And the difference is here, correct?  I enqueue to the bypass list,
> > >>> which is then flushed (in order) to the main list.  In contrast, you
> > >>> flush the bypass list, then enqueue to the main list.  Either way,
> > >>> the callback referenced by rhp ends up at the end of ->cblist.
> > >>> 
> > >>> Or am I on the wrong branch of this "if" statement?
> > >> 
> > >> But we have to flush first, and then queue the new one. Otherwise wouldn't
> > >> the callbacks be invoked out of order? Or did I miss something?
> > > 
> > > I don't think so...
> > > 
> > > We want the new callback to be last, right?  One way to do that is to
> > > flush the bypass, then queue the new callback onto ->cblist.  Another way
> > > to do that is to enqueue the new callback onto the end of the bypass,
> > > then flush the bypass.  Why wouldn't these result in the same order?
> > 
> > Yes you are right, sorry. I was fixated on the main list. Both your snippet and my patch will be equivalent then. However I find your snippet a bit confusing, as in it is not immediately obvious - why would we queue something on to a list, if we were about to flush it. But any way, it does make it a clever piece of code in some sense and I am ok with doing it this way ;-)
> 
> As long as the ->cblist.len comes out with the right value.  ;-)

The ->cblist.len's value is not effected by your suggested change, because
the bypass list's length is already accounted into the ->cblist.len, and for
the new rhp, after rcu_nocb_do_flush_bypass() is called, it either ends up in
the bypass list (if it is !lazy) or on the main cblist (if its lazy). So
everything just works. Below is the change. If its OK with you though, I will
put it in a separate commit just to be extra safe, since the code before it
was well tested and I am still testing it.

Thanks.

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()

This consolidates the code a bit and makes it cleaner. Functionally it
is the same.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index d69d058a78f9..1fc704d102a3 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -327,10 +327,11 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
  *
  * Note that this function always returns true if rhp is NULL.
  */
-static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp_in,
 				     unsigned long j, unsigned long flush_flags)
 {
 	struct rcu_cblist rcl;
+	struct rcu_head *rhp = rhp_in;
 	bool lazy = flush_flags & FLUSH_BP_LAZY;
 
 	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
@@ -347,16 +348,15 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	/*
 	 * If the new CB requested was a lazy one, queue it onto the main
 	 * ->cblist so that we can take advantage of the grace-period that will
-	 * happen regardless.
+	 * happen regardless. But queue it onto the bypass list first so that
+	 * the lazy CB is ordered with the existing CBs in the bypass list.
 	 */
 	if (lazy && rhp) {
-		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
-		rcu_cblist_enqueue(&rcl, rhp);
-		WRITE_ONCE(rdp->lazy_len, 0);
-	} else {
-		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
-		WRITE_ONCE(rdp->lazy_len, 0);
+		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+		rhp = NULL;
 	}
+	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+	WRITE_ONCE(rdp->lazy_len, 0);
 
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);

Paul E. McKenney Oct. 3, 2022, 7:49 p.m. UTC | #55

On Mon, Oct 03, 2022 at 07:33:24PM +0000, Joel Fernandes wrote:
> On Mon, Sep 26, 2022 at 04:53:51PM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 26, 2022 at 07:33:17PM -0400, Joel Fernandes wrote:
> > > 
> > > 
> > > > On Sep 26, 2022, at 6:37 PM, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > > 
> > > > On Mon, Sep 26, 2022 at 09:07:12PM +0000, Joel Fernandes wrote:
> > > >> Hi Paul,
> > > >> 
> > > >> On Mon, Sep 26, 2022 at 10:42:40AM -0700, Paul E. McKenney wrote:
> > > >> [..]
> > > >>>>>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > >>>>>>>> +    } else {
> > > >>>>>>>> +        rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> > > >>>>>>>> +        WRITE_ONCE(rdp->lazy_len, 0);
> > > >>>>>>> 
> > > >>>>>>> This WRITE_ONCE() can be dropped out of the "if" statement, correct?
> > > >>>>>> 
> > > >>>>>> Yes will update.
> > > >>>>> 
> > > >>>>> Thank you!
> > > >>>>> 
> > > >>>>>>> If so, this could be an "if" statement with two statements in its "then"
> > > >>>>>>> clause, no "else" clause, and two statements following the "if" statement.
> > > >>>>>> 
> > > >>>>>> I don’t think we can get rid of the else part but I’ll see what it looks like.
> > > >>>>> 
> > > >>>>> In the function header, s/rhp/rhp_in/, then:
> > > >>>>> 
> > > >>>>>    struct rcu_head *rhp = rhp_in;
> > > >>>>> 
> > > >>>>> And then:
> > > >>>>> 
> > > >>>>>    if (lazy && rhp) {
> > > >>>>>        rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> > > >>>>>        rhp = NULL;
> > > >>>> 
> > > >>>> This enqueues on to the bypass list, where as if lazy && rhp, I want to queue
> > > >>>> the new rhp on to the main cblist. So the pseudo code in my patch is:
> > > >>>> 
> > > >>>> if (lazy and rhp) then
> > > >>>>    1. flush bypass CBs on to main list.
> > > >>>>    2. queue new CB on to main list.
> > > >>> 
> > > >>> And the difference is here, correct?  I enqueue to the bypass list,
> > > >>> which is then flushed (in order) to the main list.  In contrast, you
> > > >>> flush the bypass list, then enqueue to the main list.  Either way,
> > > >>> the callback referenced by rhp ends up at the end of ->cblist.
> > > >>> 
> > > >>> Or am I on the wrong branch of this "if" statement?
> > > >> 
> > > >> But we have to flush first, and then queue the new one. Otherwise wouldn't
> > > >> the callbacks be invoked out of order? Or did I miss something?
> > > > 
> > > > I don't think so...
> > > > 
> > > > We want the new callback to be last, right?  One way to do that is to
> > > > flush the bypass, then queue the new callback onto ->cblist.  Another way
> > > > to do that is to enqueue the new callback onto the end of the bypass,
> > > > then flush the bypass.  Why wouldn't these result in the same order?
> > > 
> > > Yes you are right, sorry. I was fixated on the main list. Both your snippet and my patch will be equivalent then. However I find your snippet a bit confusing, as in it is not immediately obvious - why would we queue something on to a list, if we were about to flush it. But any way, it does make it a clever piece of code in some sense and I am ok with doing it this way ;-)
> > 
> > As long as the ->cblist.len comes out with the right value.  ;-)
> 
> The ->cblist.len's value is not effected by your suggested change, because
> the bypass list's length is already accounted into the ->cblist.len, and for
> the new rhp, after rcu_nocb_do_flush_bypass() is called, it either ends up in
> the bypass list (if it is !lazy) or on the main cblist (if its lazy). So
> everything just works. Below is the change. If its OK with you though, I will
> put it in a separate commit just to be extra safe, since the code before it
> was well tested and I am still testing it.

Having this as a separate simplification commit is fine by me.

And thank you for digging into this!

								Thanx, Paul

> Thanks.
> 
> ---8<-----------------------
> 
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Subject: [PATCH] rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
> 
> This consolidates the code a bit and makes it cleaner. Functionally it
> is the same.
> 
> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree_nocb.h | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index d69d058a78f9..1fc704d102a3 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -327,10 +327,11 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>   *
>   * Note that this function always returns true if rhp is NULL.
>   */
> -static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> +static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp_in,
>  				     unsigned long j, unsigned long flush_flags)
>  {
>  	struct rcu_cblist rcl;
> +	struct rcu_head *rhp = rhp_in;
>  	bool lazy = flush_flags & FLUSH_BP_LAZY;
>  
>  	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
> @@ -347,16 +348,15 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	/*
>  	 * If the new CB requested was a lazy one, queue it onto the main
>  	 * ->cblist so that we can take advantage of the grace-period that will
> -	 * happen regardless.
> +	 * happen regardless. But queue it onto the bypass list first so that
> +	 * the lazy CB is ordered with the existing CBs in the bypass list.
>  	 */
>  	if (lazy && rhp) {
> -		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> -		rcu_cblist_enqueue(&rcl, rhp);
> -		WRITE_ONCE(rdp->lazy_len, 0);
> -	} else {
> -		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> -		WRITE_ONCE(rdp->lazy_len, 0);
> +		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> +		rhp = NULL;
>  	}
> +	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +	WRITE_ONCE(rdp->lazy_len, 0);
>  
>  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>  	WRITE_ONCE(rdp->nocb_bypass_first, j);
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
>

Uladzislau Rezki Oct. 4, 2022, 11:35 a.m. UTC | #56

> > >> 
> > >> OK. Found the boot up issue. In my case i had 120 seconds delay:
> > > 
> > > Wow, nice work.
> > > 
> > >> <snip>
> > >> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > >> index 448748e3fba5..a56cfd612e3a 100644
> > >> --- a/drivers/scsi/scsi_error.c
> > >> +++ b/drivers/scsi/scsi_error.c
> > >> @@ -312,7 +312,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
> > >>        * Ensure that all tasks observe the host state change before the
> > >>        * host_failed change.
> > >>        */
> > >> -       call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
> > >> +       call_rcu_flush(&scmd->rcu, scsi_eh_inc_host_failed);
> > > 
> > > Great! Thanks. I’ll include this and the other one you converted in the next revision.
> > 
> > By the way, any chance you could check android as well, just to rule out any trouble markers? ChromeOS and your Linux distro are doing well on boot so that’s a good sign.
> > 
> I will check v6 on Android. I will get back shortly.
> 
OK. Works and boots well. I do not see any issues with Android so far.

--
Uladzislau Rezki

Joel Fernandes Oct. 4, 2022, 6:06 p.m. UTC | #57

Hi Vlad,

On Tue, Oct 4, 2022 at 7:35 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> > > >>
> > > >> OK. Found the boot up issue. In my case i had 120 seconds delay:
> > > >
> > > > Wow, nice work.
> > > >
> > > >> <snip>
> > > >> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > > >> index 448748e3fba5..a56cfd612e3a 100644
> > > >> --- a/drivers/scsi/scsi_error.c
> > > >> +++ b/drivers/scsi/scsi_error.c
> > > >> @@ -312,7 +312,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
> > > >>        * Ensure that all tasks observe the host state change before the
> > > >>        * host_failed change.
> > > >>        */
> > > >> -       call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
> > > >> +       call_rcu_flush(&scmd->rcu, scsi_eh_inc_host_failed);
> > > >
> > > > Great! Thanks. I’ll include this and the other one you converted in the next revision.
> > >
> > > By the way, any chance you could check android as well, just to rule out any trouble markers? ChromeOS and your Linux distro are doing well on boot so that’s a good sign.
> > >
> > I will check v6 on Android. I will get back shortly.
> >
> OK. Works and boots well. I do not see any issues with Android so far.
>

That's great news and thank you so much for testing Android and all
your efforts on this work!

 - Joel

[v6,1/4] rcu: Make call_rcu() lazy to save power

Commit Message

Comments

Patch