diff mbox series

[v3,3/4] rcu/nocb: Protect NOCB state via local_lock() under PREEMPT_RT

Message ID 20210811201354.1976839-4-valentin.schneider@arm.com (mailing list archive)
State New, archived
Headers show
Series rcu, arm64: PREEMPT_RT fixlets | expand

Commit Message

Valentin Schneider Aug. 11, 2021, 8:13 p.m. UTC
Warning
=======

Running v5.13-rt1 on my arm64 Juno board triggers:

[    0.156302] =============================
[    0.160416] WARNING: suspicious RCU usage
[    0.164529] 5.13.0-rt1 #20 Not tainted
[    0.168300] -----------------------------
[    0.172409] kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
[    0.179920]
[    0.179920] other info that might help us debug this:
[    0.179920]
[    0.188037]
[    0.188037] rcu_scheduler_active = 1, debug_locks = 1
[    0.194677] 3 locks held by rcuc/0/11:
[    0.198448] #0: ffff00097ef10cf8 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip (./include/linux/rcupdate.h:662 kernel/softirq.c:171)
[    0.208709] #1: ffff80001205e5f0 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock (kernel/locking/spinlock_rt.c:43 (discriminator 4))
[    0.217134] #2: ffff80001205e5f0 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip (kernel/softirq.c:169)
[    0.226428]
[    0.226428] stack backtrace:
[    0.230889] CPU: 0 PID: 11 Comm: rcuc/0 Not tainted 5.13.0-rt1 #20
[    0.237100] Hardware name: ARM Juno development board (r0) (DT)
[    0.243041] Call trace:
[    0.245497] dump_backtrace (arch/arm64/kernel/stacktrace.c:163)
[    0.249185] show_stack (arch/arm64/kernel/stacktrace.c:219)
[    0.252522] dump_stack (lib/dump_stack.c:122)
[    0.255947] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6439)
[    0.260328] rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
[    0.264537] rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
[    0.267786] rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
[    0.271644] smpboot_thread_fn (kernel/smpboot.c:165 (discriminator 3))
[    0.275767] kthread (kernel/kthread.c:321)
[    0.279013] ret_from_fork (arch/arm64/kernel/entry.S:1005)

In this case, this is the RCU core kthread accessing the local CPU's
rdp. Before that, rcu_cpu_kthread() invokes local_bh_disable().

Under !CONFIG_PREEMPT_RT (and rcutree.use_softirq=0), this ends up
incrementing the preempt_count, which satisfies the "local non-preemptible
read" of rcu_rdp_is_offloaded().

Under CONFIG_PREEMPT_RT however, this becomes

  local_lock(&softirq_ctrl.lock)

which, under the same config, is migrate_disable() + rt_spin_lock(). As
pointed out by Frederic, this is not sufficient to safely access an rdp's
offload state, as the RCU core kthread can be preempted by a kworker
executing rcu_nocb_rdp_offload() [1].

Introduce a local_lock to serialize an rdp's offload state while the rdp's
associated core kthread is executing rcu_core().

rcu_core() preemptability considerations
========================================

As pointed out by Paul [2], keeping rcu_check_quiescent_state() preemptible
(which is the case under CONFIG_PREEMPT_RT) requires some consideration.

note_gp_changes() itself runs with irqs off, and enters
__note_gp_changes() with rnp->lock held (raw_spinlock), thus is safe vs
preemption.

rdp->core_needs_qs *could* change after being read by the RCU core
kthread if it then gets preempted. Consider, with
CONFIG_RCU_STRICT_GRACE_PERIOD:

  rcuc/x                                   task_foo

    rcu_check_quiescent_state()
    `\
      rdp->core_needs_qs == true
			     <PREEMPT>
					   rcu_read_unlock()
					   `\
					     rcu_preempt_deferred_qs_irqrestore()
					     `\
					       rcu_report_qs_rdp()
					       `\
						 rdp->core_needs_qs := false;

This would let rcuc/x's rcu_check_quiescent_state() proceed further down to
rcu_report_qs_rdp(), but if task_foo's earlier rcu_report_qs_rdp()
invocation would have cleared the rdp grpmask from the rnp mask, so
rcuc/x's invocation would simply bail.

Since rcu_report_qs_rdp() can be safely invoked, even if rdp->core_needs_qs
changed, it appears safe to keep rcu_check_quiescent_state() preemptible.

[1]: http://lore.kernel.org/r/20210727230814.GC283787@lothringen
[2]: http://lore.kernel.org/r/20210729010445.GO4397@paulmck-ThinkPad-P17-Gen-1
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/rcu/tree.c        |  4 ++
 kernel/rcu/tree.h        |  4 ++
 kernel/rcu/tree_plugin.h | 82 +++++++++++++++++++++++++++++++++++-----
 3 files changed, 81 insertions(+), 9 deletions(-)

Comments

Paul E. McKenney Aug. 13, 2021, 12:20 a.m. UTC | #1
On Wed, Aug 11, 2021 at 09:13:53PM +0100, Valentin Schneider wrote:
> Warning
> =======
> 
> Running v5.13-rt1 on my arm64 Juno board triggers:
> 
> [    0.156302] =============================
> [    0.160416] WARNING: suspicious RCU usage
> [    0.164529] 5.13.0-rt1 #20 Not tainted
> [    0.168300] -----------------------------
> [    0.172409] kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
> [    0.179920]
> [    0.179920] other info that might help us debug this:
> [    0.179920]
> [    0.188037]
> [    0.188037] rcu_scheduler_active = 1, debug_locks = 1
> [    0.194677] 3 locks held by rcuc/0/11:
> [    0.198448] #0: ffff00097ef10cf8 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip (./include/linux/rcupdate.h:662 kernel/softirq.c:171)
> [    0.208709] #1: ffff80001205e5f0 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock (kernel/locking/spinlock_rt.c:43 (discriminator 4))
> [    0.217134] #2: ffff80001205e5f0 (rcu_read_lock){....}-{1:2}, at: __local_bh_disable_ip (kernel/softirq.c:169)
> [    0.226428]
> [    0.226428] stack backtrace:
> [    0.230889] CPU: 0 PID: 11 Comm: rcuc/0 Not tainted 5.13.0-rt1 #20
> [    0.237100] Hardware name: ARM Juno development board (r0) (DT)
> [    0.243041] Call trace:
> [    0.245497] dump_backtrace (arch/arm64/kernel/stacktrace.c:163)
> [    0.249185] show_stack (arch/arm64/kernel/stacktrace.c:219)
> [    0.252522] dump_stack (lib/dump_stack.c:122)
> [    0.255947] lockdep_rcu_suspicious (kernel/locking/lockdep.c:6439)
> [    0.260328] rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
> [    0.264537] rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
> [    0.267786] rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
> [    0.271644] smpboot_thread_fn (kernel/smpboot.c:165 (discriminator 3))
> [    0.275767] kthread (kernel/kthread.c:321)
> [    0.279013] ret_from_fork (arch/arm64/kernel/entry.S:1005)
> 
> In this case, this is the RCU core kthread accessing the local CPU's
> rdp. Before that, rcu_cpu_kthread() invokes local_bh_disable().
> 
> Under !CONFIG_PREEMPT_RT (and rcutree.use_softirq=0), this ends up
> incrementing the preempt_count, which satisfies the "local non-preemptible
> read" of rcu_rdp_is_offloaded().
> 
> Under CONFIG_PREEMPT_RT however, this becomes
> 
>   local_lock(&softirq_ctrl.lock)
> 
> which, under the same config, is migrate_disable() + rt_spin_lock(). As
> pointed out by Frederic, this is not sufficient to safely access an rdp's
> offload state, as the RCU core kthread can be preempted by a kworker
> executing rcu_nocb_rdp_offload() [1].
> 
> Introduce a local_lock to serialize an rdp's offload state while the rdp's
> associated core kthread is executing rcu_core().
> 
> rcu_core() preemptability considerations
> ========================================
> 
> As pointed out by Paul [2], keeping rcu_check_quiescent_state() preemptible
> (which is the case under CONFIG_PREEMPT_RT) requires some consideration.
> 
> note_gp_changes() itself runs with irqs off, and enters
> __note_gp_changes() with rnp->lock held (raw_spinlock), thus is safe vs
> preemption.
> 
> rdp->core_needs_qs *could* change after being read by the RCU core
> kthread if it then gets preempted. Consider, with
> CONFIG_RCU_STRICT_GRACE_PERIOD:
> 
>   rcuc/x                                   task_foo
> 
>     rcu_check_quiescent_state()
>     `\
>       rdp->core_needs_qs == true
> 			     <PREEMPT>
> 					   rcu_read_unlock()
> 					   `\
> 					     rcu_preempt_deferred_qs_irqrestore()
> 					     `\
> 					       rcu_report_qs_rdp()
> 					       `\
> 						 rdp->core_needs_qs := false;
> 
> This would let rcuc/x's rcu_check_quiescent_state() proceed further down to
> rcu_report_qs_rdp(), but if task_foo's earlier rcu_report_qs_rdp()
> invocation would have cleared the rdp grpmask from the rnp mask, so
> rcuc/x's invocation would simply bail.
> 
> Since rcu_report_qs_rdp() can be safely invoked, even if rdp->core_needs_qs
> changed, it appears safe to keep rcu_check_quiescent_state() preemptible.

Another concern...

During the preemption of rcu_check_quiescent_state() someone might report
a quiescent state on behalf of CPU x (perhaps due to its having recently
been idle) and then the RCU grace-period kthread might start running on
CPU x, where it might initialize a new grace period in rcu_gp_init().
It can then invoke __note_gp_changes(), also on CPU x.

If preempted as shown above just after checking >core_needs_qs, the
->cpu_no_qs.b.norm field will be set by the grace-period kthread, which
will cause the rcu_check_quiescent_state() function's subsequent check
of ->cpu_no_qs.b.norm to take an early exit.  So OK here.

On the other hand, if preempted just after the rcu_check_quiescent_state()
function's check of ->cpu_no_qs.b.norm, the later invocation of
rcu_report_qs_rdp() should take an early exit due to ->gp_seq mismatch.
So OK here.

However, this should be added to the commit log.  Might be a big commit
log, but mass storage is cheap these days.  ;-)

This needs a review of each and every manipulation of ->core_needs_qs
and ->cpu_no_qs.b.norm.  For example, the preemptions will cause the
scheduler to invoke RCU's context-switch hooks, which also mess with
->cpu_no_qs.b.norm.  I can get to that some time next week (or tomorrow,
if things go better than expected), but it would be good for you (and
others) to check as well.

Frederic should look this over, but I am taking a quick pass in the
meantime.  Please see below.

							Thanx, Paul

> [1]: http://lore.kernel.org/r/20210727230814.GC283787@lothringen
> [2]: http://lore.kernel.org/r/20210729010445.GO4397@paulmck-ThinkPad-P17-Gen-1
> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
> ---
>  kernel/rcu/tree.c        |  4 ++
>  kernel/rcu/tree.h        |  4 ++
>  kernel/rcu/tree_plugin.h | 82 +++++++++++++++++++++++++++++++++++-----
>  3 files changed, 81 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 51f24ecd94b2..caadfec994f3 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -87,6 +87,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
>  	.dynticks = ATOMIC_INIT(RCU_DYNTICK_CTRL_CTR),
>  #ifdef CONFIG_RCU_NOCB_CPU
>  	.cblist.flags = SEGCBLIST_SOFTIRQ_ONLY,
> +	.nocb_local_lock =  INIT_LOCAL_LOCK(nocb_local_lock),
>  #endif
>  };
>  static struct rcu_state rcu_state = {
> @@ -2853,10 +2854,12 @@ static void rcu_cpu_kthread(unsigned int cpu)
>  {
>  	unsigned int *statusp = this_cpu_ptr(&rcu_data.rcu_cpu_kthread_status);
>  	char work, *workp = this_cpu_ptr(&rcu_data.rcu_cpu_has_work);
> +	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  	int spincnt;
>  
>  	trace_rcu_utilization(TPS("Start CPU kthread@rcu_run"));
>  	for (spincnt = 0; spincnt < 10; spincnt++) {
> +		rcu_nocb_local_lock(rdp);
>  		local_bh_disable();
>  		*statusp = RCU_KTHREAD_RUNNING;
>  		local_irq_disable();
> @@ -2866,6 +2869,7 @@ static void rcu_cpu_kthread(unsigned int cpu)
>  		if (work)
>  			rcu_core();
>  		local_bh_enable();
> +		rcu_nocb_local_unlock(rdp);
>  		if (*workp == 0) {
>  			trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
>  			*statusp = RCU_KTHREAD_WAITING;
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 305cf6aeb408..aa6831255fec 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -210,6 +210,8 @@ struct rcu_data {
>  	struct timer_list nocb_timer;	/* Enforce finite deferral. */
>  	unsigned long nocb_gp_adv_time;	/* Last call_rcu() CB adv (jiffies). */
>  
> +	local_lock_t nocb_local_lock;

Should this go near the beginning of the structure, given that code
paths taking this lock tend to access ->cpu_no_qs, ->core_needs_qs,
and so on?

Given that it is used to protect core processing (not just offloaded
callbacks), might ->core_local_lock be a better name?

Please keep in mind that you can build kernels that offload callbacks
but that still use softirq for RCU core processing.  And vice versa,
that is, kernels that use rcuc kthreads but do not offload callbacks.

> +
>  	/* The following fields are used by call_rcu, hence own cacheline. */
>  	raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
>  	struct rcu_cblist nocb_bypass;	/* Lock-contention-bypass CB list. */
> @@ -445,6 +447,8 @@ static void rcu_nocb_unlock(struct rcu_data *rdp);
>  static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
>  				       unsigned long flags);
>  static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp);
> +static void rcu_nocb_local_lock(struct rcu_data *rdp);
> +static void rcu_nocb_local_unlock(struct rcu_data *rdp);
>  #ifdef CONFIG_RCU_NOCB_CPU
>  static void __init rcu_organize_nocb_kthreads(void);
>  #define rcu_nocb_lock_irqsave(rdp, flags)				\
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0ff5e4fb933e..11c4ff00afde 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -21,6 +21,17 @@ static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
>  	return lockdep_is_held(&rdp->nocb_lock);
>  }
>  
> +static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
> +{
> +	return lockdep_is_held(
> +#ifdef CONFIG_PREEMPT_RT
> +		&rdp->nocb_local_lock.lock
> +#else
> +		&rdp->nocb_local_lock
> +#endif

It would be good if this was abstracted.  Or is this the only place
in the kernel that needs this #ifdef?  Maybe a lockdep_is_held_rt()
or lockdep_is_held_local(), as the case may be?

> +	);
> +}
> +
>  static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
>  {
>  	/* Race on early boot between thread creation and assignment */
> @@ -38,7 +49,10 @@ static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
>  {
>  	return 0;
>  }
> -
> +static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
> +{
> +	return 0;

This is backwards of normal lockdep practice, which defaults to locks
always held.  And that will be what happens once lockdep has detected
its first deadlock, correct?  At which point, this function and its
earlier instance will be in conflict.

Or is there some subtle reason why this conflict would be OK?

> +}
>  static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
>  {
>  	return false;
> @@ -46,23 +60,44 @@ static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
>  
>  #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
>  
> +/*
> + * Is a local read of the rdp's offloaded state safe and stable?
> + * See rcu_nocb_local_lock() & family.
> + */
> +static inline bool rcu_local_offload_access_safe(struct rcu_data *rdp)
> +{
> +	if (!preemptible())
> +		return true;
> +
> +	if (!migratable()) {
> +		if (!IS_ENABLED(CONFIG_RCU_NOCB))

Do we also need to consult the use_softirq module parameter that controls
whether or not there are rcuc kthreads?

Actually, if !IS_ENABLED(CONFIG_RCU_NOCB) then rcu_rdp_is_offloaded()
can simply return false without bothering with the RCU_LOCKDEP_WARN().
Might be worth getting that out of the way of the RCU_LOCKDEP_WARN()
condition.  ;-)

> +			return true;
> +
> +		return rcu_lockdep_is_held_nocb_local(rdp);
> +	}
> +
> +	return false;
> +}
> +
>  static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
>  {
>  	/*
> -	 * In order to read the offloaded state of an rdp is a safe
> -	 * and stable way and prevent from its value to be changed
> -	 * under us, we must either hold the barrier mutex, the cpu
> -	 * hotplug lock (read or write) or the nocb lock. Local
> -	 * non-preemptible reads are also safe. NOCB kthreads and
> -	 * timers have their own means of synchronization against the
> -	 * offloaded state updaters.
> +	 * In order to read the offloaded state of an rdp is a safe and stable
> +	 * way and prevent from its value to be changed under us, we must either...
>  	 */
>  	RCU_LOCKDEP_WARN(
> +		// ...hold the barrier mutex...
>  		!(lockdep_is_held(&rcu_state.barrier_mutex) ||
> +		  // ... the cpu hotplug lock (read or write)...
>  		  (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_held()) ||
> +		  // ... or the NOCB lock.
>  		  rcu_lockdep_is_held_nocb(rdp) ||
> +		  // Local reads still require the local state to remain stable
> +		  // (preemption disabled / local lock held)
>  		  (rdp == this_cpu_ptr(&rcu_data) &&
> -		   !(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) ||
> +		   rcu_local_offload_access_safe(rdp)) ||
> +		  // NOCB kthreads and timers have their own means of synchronization
> +		  // against the offloaded state updaters.
>  		  rcu_current_is_nocb_kthread(rdp)),
>  		"Unsafe read of RCU_NOCB offloaded state"
>  	);
> @@ -1629,6 +1664,22 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
>  	}
>  }
>  
> +/*
> + * The invocation of rcu_core() within the RCU core kthreads remains preemptible
> + * under PREEMPT_RT, thus the offload state of a CPU could change while
> + * said kthreads are preempted. Prevent this from happening by protecting the
> + * offload state with a local_lock().
> + */
> +static void rcu_nocb_local_lock(struct rcu_data *rdp)
> +{
> +	local_lock(&rcu_data.nocb_local_lock);
> +}
> +
> +static void rcu_nocb_local_unlock(struct rcu_data *rdp)
> +{
> +	local_unlock(&rcu_data.nocb_local_lock);
> +}
> +
>  /* Lockdep check that ->cblist may be safely accessed. */
>  static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
>  {
> @@ -2396,6 +2447,7 @@ static int rdp_offload_toggle(struct rcu_data *rdp,
>  	if (rdp->nocb_cb_sleep)
>  		rdp->nocb_cb_sleep = false;
>  	rcu_nocb_unlock_irqrestore(rdp, flags);
> +	rcu_nocb_local_unlock(rdp);
>  
>  	/*
>  	 * Ignore former value of nocb_cb_sleep and force wake up as it could
> @@ -2427,6 +2479,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>  
>  	pr_info("De-offloading %d\n", rdp->cpu);
>  
> +	rcu_nocb_local_lock(rdp);
>  	rcu_nocb_lock_irqsave(rdp, flags);
>  	/*
>  	 * Flush once and for all now. This suffices because we are
> @@ -2509,6 +2562,7 @@ static long rcu_nocb_rdp_offload(void *arg)
>  	 * Can't use rcu_nocb_lock_irqsave() while we are in
>  	 * SEGCBLIST_SOFTIRQ_ONLY mode.
>  	 */
> +	rcu_nocb_local_lock(rdp);
>  	raw_spin_lock_irqsave(&rdp->nocb_lock, flags);

These look plausible at first glance, but it would be good for Frederic
to look at the exact placement of these rcu_nocb_local_lock() and
rcu_nocb_local_unlock() calls.

>  	/*
> @@ -2868,6 +2922,16 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
>  	local_irq_restore(flags);
>  }
>  
> +/* No ->nocb_local_lock to acquire. */
> +static void rcu_nocb_local_lock(struct rcu_data *rdp)
> +{
> +}
> +
> +/* No ->nocb_local_lock to release. */
> +static void rcu_nocb_local_unlock(struct rcu_data *rdp)
> +{
> +}
> +
>  /* Lockdep check that ->cblist may be safely accessed. */
>  static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
>  {
> -- 
> 2.25.1
>
Valentin Schneider Aug. 13, 2021, 6:48 p.m. UTC | #2
On 12/08/21 17:20, Paul E. McKenney wrote:
> On Wed, Aug 11, 2021 at 09:13:53PM +0100, Valentin Schneider wrote:
>> rcu_core() preemptability considerations
>> ========================================
>> 
>> As pointed out by Paul [2], keeping rcu_check_quiescent_state() preemptible
>> (which is the case under CONFIG_PREEMPT_RT) requires some consideration.
>> 
>> note_gp_changes() itself runs with irqs off, and enters
>> __note_gp_changes() with rnp->lock held (raw_spinlock), thus is safe vs
>> preemption.
>> 
>> rdp->core_needs_qs *could* change after being read by the RCU core
>> kthread if it then gets preempted. Consider, with
>> CONFIG_RCU_STRICT_GRACE_PERIOD:
>> 
>>   rcuc/x                                   task_foo
>> 
>>     rcu_check_quiescent_state()
>>     `\
>>       rdp->core_needs_qs == true
>> 			     <PREEMPT>
>> 					   rcu_read_unlock()
>> 					   `\
>> 					     rcu_preempt_deferred_qs_irqrestore()
>> 					     `\
>> 					       rcu_report_qs_rdp()
>> 					       `\
>> 						 rdp->core_needs_qs := false;
>> 
>> This would let rcuc/x's rcu_check_quiescent_state() proceed further down to
>> rcu_report_qs_rdp(), but if task_foo's earlier rcu_report_qs_rdp()
>> invocation would have cleared the rdp grpmask from the rnp mask, so
>> rcuc/x's invocation would simply bail.
>> 
>> Since rcu_report_qs_rdp() can be safely invoked, even if rdp->core_needs_qs
>> changed, it appears safe to keep rcu_check_quiescent_state() preemptible.
>
> Another concern...
>
> During the preemption of rcu_check_quiescent_state() someone might report
> a quiescent state on behalf of CPU x (perhaps due to its having recently
> been idle) and then the RCU grace-period kthread might start running on
> CPU x, where it might initialize a new grace period in rcu_gp_init().
> It can then invoke __note_gp_changes(), also on CPU x.
>

(this is me "writing out loud" to make sure I can follow)

I take it the preemption of rcuc/x itself would count as a quiescent state,
if the current grace period started before rcuc/x's rcu_read_lock()
(rcuc/x would end up in ->blkd_tasks but wouldn't affect ->gp_tasks).

Then if we get something to run rcu_report_qs_rnp() with a mask spanning
CPU x - I think the GP kthread doing quiescent state forcing might fit the
bill - we'll let the GP kthread initialize a new GP.

> If preempted as shown above just after checking >core_needs_qs, the
> ->cpu_no_qs.b.norm field will be set by the grace-period kthread, which
> will cause the rcu_check_quiescent_state() function's subsequent check
> of ->cpu_no_qs.b.norm to take an early exit.  So OK here.
>

Right

> On the other hand, if preempted just after the rcu_check_quiescent_state()
> function's check of ->cpu_no_qs.b.norm, the later invocation of
> rcu_report_qs_rdp() should take an early exit due to ->gp_seq mismatch.
> So OK here.
>

If, as described in your scenario above, the GP kthread has preempted
rcuc/x, initialized a new GP and has run __note_gp_changes() (on CPU x),
then wouldn't the rdp->gp_seq and rnp->gp_seq match when rcuc/x gets to run
again?

And because we would've then had a context switch between the GP kthread
and rcuc/x, we would've noted a quiescent state for CPU x, which would let
rcuc/x's rcu_report_qs_rdp() continue - or have I erred on the way there?

> However, this should be added to the commit log.  Might be a big commit
> log, but mass storage is cheap these days.  ;-)
>

No objections here!

> This needs a review of each and every manipulation of ->core_needs_qs
> and ->cpu_no_qs.b.norm.  For example, the preemptions will cause the
> scheduler to invoke RCU's context-switch hooks, which also mess with
> ->cpu_no_qs.b.norm.  I can get to that some time next week (or tomorrow,
> if things go better than expected), but it would be good for you (and
> others) to check as well.
>

I have some notes scribbled down regarding those that I need to go through
again, but that won't be before a week's time - I'll be away next week.

> Frederic should look this over, but I am taking a quick pass in the
> meantime.  Please see below.
>

Thanks!

>> @@ -210,6 +210,8 @@ struct rcu_data {
>>  	struct timer_list nocb_timer;	/* Enforce finite deferral. */
>>  	unsigned long nocb_gp_adv_time;	/* Last call_rcu() CB adv (jiffies). */
>>  
>> +	local_lock_t nocb_local_lock;
>
> Should this go near the beginning of the structure, given that code
> paths taking this lock tend to access ->cpu_no_qs, ->core_needs_qs,
> and so on?
>
> Given that it is used to protect core processing (not just offloaded
> callbacks), might ->core_local_lock be a better name?
>

It would still have to be in a #define CONFIG_RCU_NOCB_CPU region IMO, as
it only exists to protect access of the offloading state - hence the name,
but I'm not particularly attached to it.

> Please keep in mind that you can build kernels that offload callbacks
> but that still use softirq for RCU core processing.  And vice versa,
> that is, kernels that use rcuc kthreads but do not offload callbacks.
>

AFAICT the problem only stands if core processing becomes preemptible,
which is only the case under PREEMPT_RT (at least for now), and that implies
core processing done purely via kthreads.

>> +
>>  	/* The following fields are used by call_rcu, hence own cacheline. */
>>  	raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
>>  	struct rcu_cblist nocb_bypass;	/* Lock-contention-bypass CB list. */
>> @@ -445,6 +447,8 @@ static void rcu_nocb_unlock(struct rcu_data *rdp);
>>  static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
>>  				       unsigned long flags);
>>  static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp);
>> +static void rcu_nocb_local_lock(struct rcu_data *rdp);
>> +static void rcu_nocb_local_unlock(struct rcu_data *rdp);
>>  #ifdef CONFIG_RCU_NOCB_CPU
>>  static void __init rcu_organize_nocb_kthreads(void);
>>  #define rcu_nocb_lock_irqsave(rdp, flags)				\
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index 0ff5e4fb933e..11c4ff00afde 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -21,6 +21,17 @@ static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
>>  	return lockdep_is_held(&rdp->nocb_lock);
>>  }
>>  
>> +static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
>> +{
>> +	return lockdep_is_held(
>> +#ifdef CONFIG_PREEMPT_RT
>> +		&rdp->nocb_local_lock.lock
>> +#else
>> +		&rdp->nocb_local_lock
>> +#endif
>
> It would be good if this was abstracted.  Or is this the only place
> in the kernel that needs this #ifdef?  Maybe a lockdep_is_held_rt()
> or lockdep_is_held_local(), as the case may be?
>

It does look like the first/only place that tries to access a local_lock's
dep_map regardless of CONFIG_PREEMPT_RT. Looking at it some more, I'm not
sure if it's really useful: !PREEMPT_RT local_locks disable preemption, so
the preemption check in rcu_rdp_is_offloaded() will short circuit the
lockdep check when !PREEMPT_RT...

One thing I should perhaps point out - the local lock is only really needed
for PREEMPT_RT; for !PREEMPT_RT it will just disable preemption, and it's
only used in places that *already* disabled preemption (when !PREEMPT_RT).
I initially gated the lock under CONFIG_PREEMPT_RT, but changed that as I
found it reduced the ifdeffery.

>> +	);
>> +}
>> +
>>  static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
>>  {
>>  	/* Race on early boot between thread creation and assignment */
>> @@ -38,7 +49,10 @@ static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
>>  {
>>  	return 0;
>>  }
>> -
>> +static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
>> +{
>> +	return 0;
>
> This is backwards of normal lockdep practice, which defaults to locks
> always held.  And that will be what happens once lockdep has detected
> its first deadlock, correct?  At which point, this function and its
> earlier instance will be in conflict.
>
> Or is there some subtle reason why this conflict would be OK?
>

This follows the !CONFIG_RCU_NOCB definition of rcu_lockdep_is_held_nocb(),
which looked fine to me.

Actually with the way I wrote rcu_local_offload_access_safe(), we don't
even access that function when !CONFIG_RCU_NOCB...

>> +}
>>  static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
>>  {
>>  	return false;
>> @@ -46,23 +60,44 @@ static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
>>  
>>  #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
>>  
>> +/*
>> + * Is a local read of the rdp's offloaded state safe and stable?
>> + * See rcu_nocb_local_lock() & family.
>> + */
>> +static inline bool rcu_local_offload_access_safe(struct rcu_data *rdp)
>> +{
>> +	if (!preemptible())
>> +		return true;
>> +
>> +	if (!migratable()) {
>> +		if (!IS_ENABLED(CONFIG_RCU_NOCB))
>
> Do we also need to consult the use_softirq module parameter that controls
> whether or not there are rcuc kthreads?
>
> Actually, if !IS_ENABLED(CONFIG_RCU_NOCB) then rcu_rdp_is_offloaded()
> can simply return false without bothering with the RCU_LOCKDEP_WARN().
> Might be worth getting that out of the way of the RCU_LOCKDEP_WARN()
> condition.  ;-)
>

I _assumed_ the check was there even for !CONFIG_RCU_NOCB to provide a wider
test coverage of the calling conditions.

If that RCU_LOCKDEP_WARN() becomes conditionnal on CONFIG_RCU_NOCB then
yes, we could clean that up.

>> +			return true;
>> +
>> +		return rcu_lockdep_is_held_nocb_local(rdp);
>> +	}
>> +
>> +	return false;
>> +}
>> +
Sebastian Andrzej Siewior Aug. 17, 2021, 3:36 p.m. UTC | #3
On 2021-08-11 21:13:53 [+0100], Valentin Schneider wrote:
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0ff5e4fb933e..11c4ff00afde 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -21,6 +21,17 @@ static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
>  	return lockdep_is_held(&rdp->nocb_lock);
>  }
>  
> +static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
> +{
> +	return lockdep_is_held(
> +#ifdef CONFIG_PREEMPT_RT
> +		&rdp->nocb_local_lock.lock
> +#else
> +		&rdp->nocb_local_lock
> +#endif
> +	);
> +}

Now that I see it and Paul asked for it, please just use !RT version.
	return lockdep_is_held(&rdp->nocb_local_lock);

and RT will work, too.

>  static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
>  {
>  	/* Race on early boot between thread creation and assignment */
> @@ -1629,6 +1664,22 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
>  	}
>  }
>  
> +/*
> + * The invocation of rcu_core() within the RCU core kthreads remains preemptible
> + * under PREEMPT_RT, thus the offload state of a CPU could change while
> + * said kthreads are preempted. Prevent this from happening by protecting the
> + * offload state with a local_lock().
> + */
> +static void rcu_nocb_local_lock(struct rcu_data *rdp)
> +{
> +	local_lock(&rcu_data.nocb_local_lock);
> +}
> +
> +static void rcu_nocb_local_unlock(struct rcu_data *rdp)
> +{
> +	local_unlock(&rcu_data.nocb_local_lock);
> +}
> +
Do you need to pass rdp given that it is not used?

>  /* Lockdep check that ->cblist may be safely accessed. */
>  static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
>  {

Sebastian
Valentin Schneider Aug. 22, 2021, 6:15 p.m. UTC | #4
On 17/08/21 17:36, Sebastian Andrzej Siewior wrote:
>> +static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
>> +{
>> +	return lockdep_is_held(
>> +#ifdef CONFIG_PREEMPT_RT
>> +		&rdp->nocb_local_lock.lock
>> +#else
>> +		&rdp->nocb_local_lock
>> +#endif
>> +	);
>> +}
>
> Now that I see it and Paul asked for it, please just use !RT version.
>       return lockdep_is_held(&rdp->nocb_local_lock);
>
> and RT will work, too.
>

The above was required due to the (previous) definition of local_lock_t:

  typedef struct {
          spinlock_t		lock;
  } local_lock_t;

On -rt11 I see this is now:

  typedef spinlock_t local_lock_t;

which indeed means the iffdefery can (actually it *has* to) go.

>>  static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
>>  {
>>      /* Race on early boot between thread creation and assignment */
>> @@ -1629,6 +1664,22 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
>>      }
>>  }
>>
>> +/*
>> + * The invocation of rcu_core() within the RCU core kthreads remains preemptible
>> + * under PREEMPT_RT, thus the offload state of a CPU could change while
>> + * said kthreads are preempted. Prevent this from happening by protecting the
>> + * offload state with a local_lock().
>> + */
>> +static void rcu_nocb_local_lock(struct rcu_data *rdp)
>> +{
>> +	local_lock(&rcu_data.nocb_local_lock);
>> +}
>> +
>> +static void rcu_nocb_local_unlock(struct rcu_data *rdp)
>> +{
>> +	local_unlock(&rcu_data.nocb_local_lock);
>> +}
>> +
> Do you need to pass rdp given that it is not used?
>

Not anymore, you're right.

>>  /* Lockdep check that ->cblist may be safely accessed. */
>>  static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
>>  {
>
> Sebastian
Thomas Gleixner Sept. 21, 2021, 2:05 p.m. UTC | #5
Valentin,

On Wed, Aug 11 2021 at 21:13, Valentin Schneider wrote:
> Running v5.13-rt1 on my arm64 Juno board triggers:
>
> [    0.156302] =============================
> [    0.160416] WARNING: suspicious RCU usage
> [    0.172409] kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
> [    0.260328] rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
> [    0.264537] rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
> [    0.267786] rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
>
> In this case, this is the RCU core kthread accessing the local CPU's
> rdp. Before that, rcu_cpu_kthread() invokes local_bh_disable().
>
> Under !CONFIG_PREEMPT_RT (and rcutree.use_softirq=0), this ends up
> incrementing the preempt_count, which satisfies the "local non-preemptible
> read" of rcu_rdp_is_offloaded().
>
> Under CONFIG_PREEMPT_RT however, this becomes
>
>   local_lock(&softirq_ctrl.lock)
>
> which, under the same config, is migrate_disable() + rt_spin_lock(). As
> pointed out by Frederic, this is not sufficient to safely access an rdp's
> offload state, as the RCU core kthread can be preempted by a kworker
> executing rcu_nocb_rdp_offload() [1].
>
> Introduce a local_lock to serialize an rdp's offload state while the rdp's
> associated core kthread is executing rcu_core().

Yes, sure. But I don't think that local_lock is required at all.

The point is that the two places where this actually matters invoke
rcu_rdp_is_offloaded() just at the top of the function outside of the
anyway existing protection sections. Moving it into the sections which
already provide the required protections makes it just work for both RT
and !RT.

Thanks,

        tglx
---

--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2278,13 +2278,13 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
 {
 	unsigned long flags;
 	unsigned long mask;
-	bool needwake = false;
-	const bool offloaded = rcu_rdp_is_offloaded(rdp);
+	bool offloaded, needwake = false;
 	struct rcu_node *rnp;
 
 	WARN_ON_ONCE(rdp->cpu != smp_processor_id());
 	rnp = rdp->mynode;
 	raw_spin_lock_irqsave_rcu_node(rnp, flags);
+	offloaded = rcu_rdp_is_offloaded(rdp);
 	if (rdp->cpu_no_qs.b.norm || rdp->gp_seq != rnp->gp_seq ||
 	    rdp->gpwrap) {
 
@@ -2446,7 +2446,7 @@ static void rcu_do_batch(struct rcu_data
 	int div;
 	bool __maybe_unused empty;
 	unsigned long flags;
-	const bool offloaded = rcu_rdp_is_offloaded(rdp);
+	bool offloaded;
 	struct rcu_head *rhp;
 	struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
 	long bl, count = 0;
@@ -2472,6 +2472,7 @@ static void rcu_do_batch(struct rcu_data
 	rcu_nocb_lock(rdp);
 	WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
 	pending = rcu_segcblist_n_cbs(&rdp->cblist);
+	offloaded = rcu_rdp_is_offloaded(rdp);
 	div = READ_ONCE(rcu_divisor);
 	div = div < 0 ? 7 : div > sizeof(long) * 8 - 2 ? sizeof(long) * 8 - 2 : div;
 	bl = max(rdp->blimit, pending >> div);
Frederic Weisbecker Sept. 21, 2021, 11:36 p.m. UTC | #6
On Tue, Sep 21, 2021 at 11:12:50PM +0200, Thomas Gleixner wrote:
> Valentin reported warnings about suspicious RCU usage on RT kernels. Those
> happen when offloading of RCU callbacks is enabled:
> 
>   WARNING: suspicious RCU usage
>   5.13.0-rt1 #20 Not tainted
>   -----------------------------
>   kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
> 
>   rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
>   rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
>   rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
> 
> The reason is that rcu_rdp_is_offloaded() is invoked without one of the
> required protections on RT enabled kernels because local_bh_disable() does
> not disable preemption on RT.
> 
> Valentin proposed to add a local lock to the code in question, but that's
> suboptimal in several aspects:
> 
>   1) local locks add extra code to !RT kernels for no value.
> 
>   2) All possible callsites have to audited and amended when affected
>      possible at an outer function level due to lock nesting issues.
> 
>   3) As the local lock has to be taken at the outer functions it's required
>      to release and reacquire them in the inner code sections which might
>      voluntary schedule, e.g. rcu_do_batch().
> 
> Both callsites of rcu_rdp_is_offloaded() which trigger this check invoke
> rcu_rdp_is_offloaded() in the variable declaration section right at the top
> of the functions. But the actual usage of the result is either within a
> section which provides the required protections or after such a section.
> 
> So the obvious solution is to move the invocation into the code sections
> which provide the proper protections, which solves the problem for RT and
> does not have any impact on !RT kernels.

You also need to consider rcu_segcblist_completely_offloaded(). There
are two users:

1) The first chunk using it in rcu_core() checks if there is a need to
accelerate the callback and that can happen concurrently with nocb
manipulations on the cblist. Concurrent (de-)offloading could mess
up with the locking state but here is what we can do:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bce848e50512..3e56a1a4af03 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2728,9 +2728,10 @@ static __latent_entropy void rcu_core(void)
 
 	/* No grace period and unregistered callbacks? */
 	if (!rcu_gp_in_progress() &&
-	    rcu_segcblist_is_enabled(&rdp->cblist) && do_batch) {
+	    rcu_segcblist_is_enabled(&rdp->cblist)) {
 		rcu_nocb_lock_irqsave(rdp, flags);
-		if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
+		if (!rcu_segcblist_completely_offloaded(&rdp->cblist) &&
+		    !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
 			rcu_accelerate_cbs_unlocked(rnp, rdp);
 		rcu_nocb_unlock_irqrestore(rdp, flags);
 	}
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 305cf6aeb408..64d615be3346 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -449,10 +449,9 @@ static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp);
 static void __init rcu_organize_nocb_kthreads(void);
 #define rcu_nocb_lock_irqsave(rdp, flags)				\
 do {									\
+	local_irq_save(flags);						\
 	if (!rcu_segcblist_is_offloaded(&(rdp)->cblist))		\
-		local_irq_save(flags);					\
-	else								\
-		raw_spin_lock_irqsave(&(rdp)->nocb_lock, (flags));	\
+		raw_spin_lock(&(rdp)->nocb_lock);			\
 } while (0)
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)


Doing the local_irq_save() before checking that the segcblist is offloaded
protect that state from being changed (provided we lock the local rdp). Then we
can safely manipulate cblist, whether locked or unlocked.

2) The actual call to rcu_do_batch(). If we are preempted between
rcu_segcblist_completely_offloaded() and rcu_do_batch() with a deoffload in
the middle, we miss the callback invocation. Invoking rcu_core by the end of
deoffloading process should solve that.

> 
> Reported-by: Valentin Schneider <valentin.schneider@arm.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/rcu/tree.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2278,13 +2278,13 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
>  {
>  	unsigned long flags;
>  	unsigned long mask;
> -	bool needwake = false;
> -	const bool offloaded = rcu_rdp_is_offloaded(rdp);
> +	bool offloaded, needwake = false;
>  	struct rcu_node *rnp;
>  
>  	WARN_ON_ONCE(rdp->cpu != smp_processor_id());
>  	rnp = rdp->mynode;
>  	raw_spin_lock_irqsave_rcu_node(rnp, flags);
> +	offloaded = rcu_rdp_is_offloaded(rdp);
>  	if (rdp->cpu_no_qs.b.norm || rdp->gp_seq != rnp->gp_seq ||
>  	    rdp->gpwrap) {

BTW Paul, if we happen to switch to non-NOCB (deoffload) some time after
rcu_report_qs_rdp(), it's possible that the rcu_accelerate_cbs()
that was supposed to be handled by nocb kthreads on behalf of
rcu_core() -> rcu_report_qs_rdp() would not happen. At least not until
we invoke rcu_core() again. Not sure how much harm that could cause.

> @@ -2446,7 +2446,7 @@ static void rcu_do_batch(struct rcu_data
>  	int div;
>  	bool __maybe_unused empty;
>  	unsigned long flags;
> -	const bool offloaded = rcu_rdp_is_offloaded(rdp);
> +	bool offloaded;
>  	struct rcu_head *rhp;
>  	struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
>  	long bl, count = 0;
> @@ -2472,6 +2472,7 @@ static void rcu_do_batch(struct rcu_data
>  	rcu_nocb_lock(rdp);
>  	WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
>  	pending = rcu_segcblist_n_cbs(&rdp->cblist);
> +	offloaded = rcu_rdp_is_offloaded(rdp);

offloaded is also checked later in rcu_do_batch(), after irqrestore. In
fact that should even become a rcu_segcblist_completely_offloaded() check
there because if there are still pending callbacks while we are de-offloading,
rcu_core() should be invoked. Hmm but that might be a remote rcu_core...

Anyway I guess we could live with some of those races with invoking rcu core on the
target after deoffloading.

I guess I should cook a series to handle all these issues one by one, then
probably we can live without a local_lock().

Thanks.



>  	div = READ_ONCE(rcu_divisor);
>  	div = div < 0 ? 7 : div > sizeof(long) * 8 - 2 ? sizeof(long) * 8 - 2 : div;
>  	bl = max(rdp->blimit, pending >> div);
Frederic Weisbecker Sept. 21, 2021, 11:45 p.m. UTC | #7
On Tue, Sep 21, 2021 at 11:12:50PM +0200, Thomas Gleixner wrote:
> Valentin reported warnings about suspicious RCU usage on RT kernels. Those
> happen when offloading of RCU callbacks is enabled:
> 
>   WARNING: suspicious RCU usage
>   5.13.0-rt1 #20 Not tainted
>   -----------------------------
>   kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
> 
>   rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
>   rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
>   rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
> 
> The reason is that rcu_rdp_is_offloaded() is invoked without one of the
> required protections on RT enabled kernels because local_bh_disable() does
> not disable preemption on RT.
> 
> Valentin proposed to add a local lock to the code in question, but that's
> suboptimal in several aspects:
> 
>   1) local locks add extra code to !RT kernels for no value.
> 
>   2) All possible callsites have to audited and amended when affected
>      possible at an outer function level due to lock nesting issues.
> 
>   3) As the local lock has to be taken at the outer functions it's required
>      to release and reacquire them in the inner code sections which might
>      voluntary schedule, e.g. rcu_do_batch().
> 
> Both callsites of rcu_rdp_is_offloaded() which trigger this check invoke
> rcu_rdp_is_offloaded() in the variable declaration section right at the top
> of the functions. But the actual usage of the result is either within a
> section which provides the required protections or after such a section.
> 
> So the obvious solution is to move the invocation into the code sections
> which provide the proper protections, which solves the problem for RT and
> does not have any impact on !RT kernels.

Also while at it, I'm asking again: traditionally softirqs could assume that
manipulating a local state was safe against !irq_count() code fiddling with
the same state on the same CPU.

Now with preemptible softirqs, that assumption can be broken anytime. RCU was
fortunate enough to have a warning for that. But who knows how many issues like
this are lurking?

Thanks.
Paul E. McKenney Sept. 22, 2021, 2:18 a.m. UTC | #8
On Wed, Sep 22, 2021 at 01:36:27AM +0200, Frederic Weisbecker wrote:
> On Tue, Sep 21, 2021 at 11:12:50PM +0200, Thomas Gleixner wrote:
> > Valentin reported warnings about suspicious RCU usage on RT kernels. Those
> > happen when offloading of RCU callbacks is enabled:
> > 
> >   WARNING: suspicious RCU usage
> >   5.13.0-rt1 #20 Not tainted
> >   -----------------------------
> >   kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
> > 
> >   rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
> >   rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
> >   rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
> > 
> > The reason is that rcu_rdp_is_offloaded() is invoked without one of the
> > required protections on RT enabled kernels because local_bh_disable() does
> > not disable preemption on RT.
> > 
> > Valentin proposed to add a local lock to the code in question, but that's
> > suboptimal in several aspects:
> > 
> >   1) local locks add extra code to !RT kernels for no value.
> > 
> >   2) All possible callsites have to audited and amended when affected
> >      possible at an outer function level due to lock nesting issues.
> > 
> >   3) As the local lock has to be taken at the outer functions it's required
> >      to release and reacquire them in the inner code sections which might
> >      voluntary schedule, e.g. rcu_do_batch().
> > 
> > Both callsites of rcu_rdp_is_offloaded() which trigger this check invoke
> > rcu_rdp_is_offloaded() in the variable declaration section right at the top
> > of the functions. But the actual usage of the result is either within a
> > section which provides the required protections or after such a section.
> > 
> > So the obvious solution is to move the invocation into the code sections
> > which provide the proper protections, which solves the problem for RT and
> > does not have any impact on !RT kernels.
> 
> You also need to consider rcu_segcblist_completely_offloaded(). There
> are two users:
> 
> 1) The first chunk using it in rcu_core() checks if there is a need to
> accelerate the callback and that can happen concurrently with nocb
> manipulations on the cblist. Concurrent (de-)offloading could mess
> up with the locking state but here is what we can do:
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index bce848e50512..3e56a1a4af03 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2728,9 +2728,10 @@ static __latent_entropy void rcu_core(void)
>  
>  	/* No grace period and unregistered callbacks? */
>  	if (!rcu_gp_in_progress() &&
> -	    rcu_segcblist_is_enabled(&rdp->cblist) && do_batch) {
> +	    rcu_segcblist_is_enabled(&rdp->cblist)) {
>  		rcu_nocb_lock_irqsave(rdp, flags);
> -		if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
> +		if (!rcu_segcblist_completely_offloaded(&rdp->cblist) &&
> +		    !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
>  			rcu_accelerate_cbs_unlocked(rnp, rdp);
>  		rcu_nocb_unlock_irqrestore(rdp, flags);
>  	}
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 305cf6aeb408..64d615be3346 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -449,10 +449,9 @@ static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp);
>  static void __init rcu_organize_nocb_kthreads(void);
>  #define rcu_nocb_lock_irqsave(rdp, flags)				\
>  do {									\
> +	local_irq_save(flags);						\
>  	if (!rcu_segcblist_is_offloaded(&(rdp)->cblist))		\
> -		local_irq_save(flags);					\
> -	else								\
> -		raw_spin_lock_irqsave(&(rdp)->nocb_lock, (flags));	\
> +		raw_spin_lock(&(rdp)->nocb_lock);			\
>  } while (0)
>  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
>  #define rcu_nocb_lock_irqsave(rdp, flags) local_irq_save(flags)
> 
> 
> Doing the local_irq_save() before checking that the segcblist is offloaded
> protect that state from being changed (provided we lock the local rdp). Then we
> can safely manipulate cblist, whether locked or unlocked.
> 
> 2) The actual call to rcu_do_batch(). If we are preempted between
> rcu_segcblist_completely_offloaded() and rcu_do_batch() with a deoffload in
> the middle, we miss the callback invocation. Invoking rcu_core by the end of
> deoffloading process should solve that.

Maybe invoke rcu_core() at that point?  My concern is that there might
be an extended time between the missed rcu_do_batch() and the end of
the deoffloading process.

> > Reported-by: Valentin Schneider <valentin.schneider@arm.com>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > ---
> >  kernel/rcu/tree.c |    7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2278,13 +2278,13 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
> >  {
> >  	unsigned long flags;
> >  	unsigned long mask;
> > -	bool needwake = false;
> > -	const bool offloaded = rcu_rdp_is_offloaded(rdp);
> > +	bool offloaded, needwake = false;
> >  	struct rcu_node *rnp;
> >  
> >  	WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> >  	rnp = rdp->mynode;
> >  	raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > +	offloaded = rcu_rdp_is_offloaded(rdp);
> >  	if (rdp->cpu_no_qs.b.norm || rdp->gp_seq != rnp->gp_seq ||
> >  	    rdp->gpwrap) {
> 
> BTW Paul, if we happen to switch to non-NOCB (deoffload) some time after
> rcu_report_qs_rdp(), it's possible that the rcu_accelerate_cbs()
> that was supposed to be handled by nocb kthreads on behalf of
> rcu_core() -> rcu_report_qs_rdp() would not happen. At least not until
> we invoke rcu_core() again. Not sure how much harm that could cause.

Again, can we just invoke rcu_core() as soon as this is noticed?

> > @@ -2446,7 +2446,7 @@ static void rcu_do_batch(struct rcu_data
> >  	int div;
> >  	bool __maybe_unused empty;
> >  	unsigned long flags;
> > -	const bool offloaded = rcu_rdp_is_offloaded(rdp);
> > +	bool offloaded;
> >  	struct rcu_head *rhp;
> >  	struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
> >  	long bl, count = 0;
> > @@ -2472,6 +2472,7 @@ static void rcu_do_batch(struct rcu_data
> >  	rcu_nocb_lock(rdp);
> >  	WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
> >  	pending = rcu_segcblist_n_cbs(&rdp->cblist);
> > +	offloaded = rcu_rdp_is_offloaded(rdp);
> 
> offloaded is also checked later in rcu_do_batch(), after irqrestore. In
> fact that should even become a rcu_segcblist_completely_offloaded() check
> there because if there are still pending callbacks while we are de-offloading,
> rcu_core() should be invoked. Hmm but that might be a remote rcu_core...
> 
> Anyway I guess we could live with some of those races with invoking rcu core on the
> target after deoffloading.
> 
> I guess I should cook a series to handle all these issues one by one, then
> probably we can live without a local_lock().

And thank you very much for looking this over!  Not simple stuff.  ;-)

							Thanx, Paul

> Thanks.
> 
> 
> 
> >  	div = READ_ONCE(rcu_divisor);
> >  	div = div < 0 ? 7 : div > sizeof(long) * 8 - 2 ? sizeof(long) * 8 - 2 : div;
> >  	bl = max(rdp->blimit, pending >> div);
Sebastian Andrzej Siewior Sept. 22, 2021, 6:32 a.m. UTC | #9
On 2021-09-22 01:45:18 [+0200], Frederic Weisbecker wrote:
> 
> Also while at it, I'm asking again: traditionally softirqs could assume that
> manipulating a local state was safe against !irq_count() code fiddling with
> the same state on the same CPU.
> 
> Now with preemptible softirqs, that assumption can be broken anytime. RCU was
> fortunate enough to have a warning for that. But who knows how many issues like
> this are lurking?

If "local state" is modified then it is safe as long as it is modified
within a local_bh_disable() section. And we are in this section while
invoking a forced-threaded interrupt. The special part about RCU is
that it is used in_irq() as part of core-code.

> Thanks.

Sebastian
Frederic Weisbecker Sept. 22, 2021, 11:10 a.m. UTC | #10
On Wed, Sep 22, 2021 at 08:32:08AM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-22 01:45:18 [+0200], Frederic Weisbecker wrote:
> > 
> > Also while at it, I'm asking again: traditionally softirqs could assume that
> > manipulating a local state was safe against !irq_count() code fiddling with
> > the same state on the same CPU.
> > 
> > Now with preemptible softirqs, that assumption can be broken anytime. RCU was
> > fortunate enough to have a warning for that. But who knows how many issues like
> > this are lurking?
> 
> If "local state" is modified then it is safe as long as it is modified
> within a local_bh_disable() section. And we are in this section while
> invoking a forced-threaded interrupt. The special part about RCU is
> that it is used in_irq() as part of core-code.

But local_bh_disable() was deemed for protecting from interrupting softirqs,
not the other way around (softirqs being preempted by other tasks). The latter
semantic is new and nobody had that in mind until softirqs have been made
preemptible.

For example:

                             CPU 0
          -----------------------------------------------
          SOFTIRQ                            RANDOM TASK
          ------                             -----------
          int *X = &per_cpu(CPUX, 0)         int *X = &per_cpu(CPUX, 0)
          int A, B;                          WRITE_ONCE(*X, 0);
                                             WRITE_ONCE(*X, 1);
          A = READ_ONCE(*X);
          B = READ_ONCE(*X);


We used to have the guarantee that A == B. That's not true anymore. Now
some new explicit local_bh_disable() should be carefully placed on RANDOM_TASK
where it wasn't necessary before. RCU is not that special in this regard.
Sebastian Andrzej Siewior Sept. 22, 2021, 11:27 a.m. UTC | #11
On 2021-09-22 13:10:12 [+0200], Frederic Weisbecker wrote:
> On Wed, Sep 22, 2021 at 08:32:08AM +0200, Sebastian Andrzej Siewior wrote:
> > On 2021-09-22 01:45:18 [+0200], Frederic Weisbecker wrote:
> > > 
> > > Also while at it, I'm asking again: traditionally softirqs could assume that
> > > manipulating a local state was safe against !irq_count() code fiddling with
> > > the same state on the same CPU.
> > > 
> > > Now with preemptible softirqs, that assumption can be broken anytime. RCU was
> > > fortunate enough to have a warning for that. But who knows how many issues like
> > > this are lurking?
> > 
> > If "local state" is modified then it is safe as long as it is modified
> > within a local_bh_disable() section. And we are in this section while
> > invoking a forced-threaded interrupt. The special part about RCU is
> > that it is used in_irq() as part of core-code.
> 
> But local_bh_disable() was deemed for protecting from interrupting softirqs,
> not the other way around (softirqs being preempted by other tasks). The latter
> semantic is new and nobody had that in mind until softirqs have been made
> preemptible.
> 
> For example:
> 
>                              CPU 0
>           -----------------------------------------------
>           SOFTIRQ                            RANDOM TASK
>           ------                             -----------
>           int *X = &per_cpu(CPUX, 0)         int *X = &per_cpu(CPUX, 0)
>           int A, B;                          WRITE_ONCE(*X, 0);
>                                              WRITE_ONCE(*X, 1);
>           A = READ_ONCE(*X);
>           B = READ_ONCE(*X);
> 
> 
> We used to have the guarantee that A == B. That's not true anymore. Now
> some new explicit local_bh_disable() should be carefully placed on RANDOM_TASK
> where it wasn't necessary before. RCU is not that special in this regard.

The part with rcutree.use_softirq=0 on RT does not make it any better,
right?
So you rely on some implicit behaviour which breaks with RT such as:

                              CPU 0
           -----------------------------------------------
           RANDOM TASK-A                      RANDOM TASK-B
           ------                             -----------
           int *X = &per_cpu(CPUX, 0)         int *X = &per_cpu(CPUX, 0)
           int A, B;                          
					      spin_lock(&D);
           spin_lock(&C);
	   				      WRITE_ONCE(*X, 0);
           A = READ_ONCE(*X);
                                              WRITE_ONCE(*X, 1);
           B = READ_ONCE(*X);

while spinlock C and D are just random locks not related to CPUX but it
just happens that they are held at that time. So for !RT you guarantee
that A == B while it is not the case on RT.

Sebastian
Frederic Weisbecker Sept. 22, 2021, 11:31 a.m. UTC | #12
On Tue, Sep 21, 2021 at 07:18:37PM -0700, Paul E. McKenney wrote:
> On Wed, Sep 22, 2021 at 01:36:27AM +0200, Frederic Weisbecker wrote:
> > Doing the local_irq_save() before checking that the segcblist is offloaded
> > protect that state from being changed (provided we lock the local rdp). Then we
> > can safely manipulate cblist, whether locked or unlocked.
> > 
> > 2) The actual call to rcu_do_batch(). If we are preempted between
> > rcu_segcblist_completely_offloaded() and rcu_do_batch() with a deoffload in
> > the middle, we miss the callback invocation. Invoking rcu_core by the end of
> > deoffloading process should solve that.
> 
> Maybe invoke rcu_core() at that point?  My concern is that there might
> be an extended time between the missed rcu_do_batch() and the end of
> the deoffloading process.

Agreed!

> 
> > > Reported-by: Valentin Schneider <valentin.schneider@arm.com>
> > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > > ---
> > >  kernel/rcu/tree.c |    7 ++++---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -2278,13 +2278,13 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
> > >  {
> > >  	unsigned long flags;
> > >  	unsigned long mask;
> > > -	bool needwake = false;
> > > -	const bool offloaded = rcu_rdp_is_offloaded(rdp);
> > > +	bool offloaded, needwake = false;
> > >  	struct rcu_node *rnp;
> > >  
> > >  	WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > >  	rnp = rdp->mynode;
> > >  	raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > +	offloaded = rcu_rdp_is_offloaded(rdp);
> > >  	if (rdp->cpu_no_qs.b.norm || rdp->gp_seq != rnp->gp_seq ||
> > >  	    rdp->gpwrap) {
> > 
> > BTW Paul, if we happen to switch to non-NOCB (deoffload) some time after
> > rcu_report_qs_rdp(), it's possible that the rcu_accelerate_cbs()
> > that was supposed to be handled by nocb kthreads on behalf of
> > rcu_core() -> rcu_report_qs_rdp() would not happen. At least not until
> > we invoke rcu_core() again. Not sure how much harm that could cause.
> 
> Again, can we just invoke rcu_core() as soon as this is noticed?

Right. So I'm going to do things a bit differently. I'm going to add
a new segcblist state flag so that during the deoffloading process,
the first very step is an invoke_rcu_core() on the target after setting a
flag that requires handling all this things: accelerate/do_batch, etc...

Then will remain the "do we still have pending callbacks after do_batch?"
in which case we'll need to invoke the rcu_core again as long as we are in
the middle of deoffloading.

Ok, now to write the patches.
Frederic Weisbecker Sept. 22, 2021, 11:38 a.m. UTC | #13
On Wed, Sep 22, 2021 at 01:27:31PM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-22 13:10:12 [+0200], Frederic Weisbecker wrote:
> > On Wed, Sep 22, 2021 at 08:32:08AM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2021-09-22 01:45:18 [+0200], Frederic Weisbecker wrote:
> > > > 
> > > > Also while at it, I'm asking again: traditionally softirqs could assume that
> > > > manipulating a local state was safe against !irq_count() code fiddling with
> > > > the same state on the same CPU.
> > > > 
> > > > Now with preemptible softirqs, that assumption can be broken anytime. RCU was
> > > > fortunate enough to have a warning for that. But who knows how many issues like
> > > > this are lurking?
> > > 
> > > If "local state" is modified then it is safe as long as it is modified
> > > within a local_bh_disable() section. And we are in this section while
> > > invoking a forced-threaded interrupt. The special part about RCU is
> > > that it is used in_irq() as part of core-code.
> > 
> > But local_bh_disable() was deemed for protecting from interrupting softirqs,
> > not the other way around (softirqs being preempted by other tasks). The latter
> > semantic is new and nobody had that in mind until softirqs have been made
> > preemptible.
> > 
> > For example:
> > 
> >                              CPU 0
> >           -----------------------------------------------
> >           SOFTIRQ                            RANDOM TASK
> >           ------                             -----------
> >           int *X = &per_cpu(CPUX, 0)         int *X = &per_cpu(CPUX, 0)
> >           int A, B;                          WRITE_ONCE(*X, 0);
> >                                              WRITE_ONCE(*X, 1);
> >           A = READ_ONCE(*X);
> >           B = READ_ONCE(*X);
> > 
> > 
> > We used to have the guarantee that A == B. That's not true anymore. Now
> > some new explicit local_bh_disable() should be carefully placed on RANDOM_TASK
> > where it wasn't necessary before. RCU is not that special in this regard.
> 
> The part with rcutree.use_softirq=0 on RT does not make it any better,
> right?

The rcuc kthread disables softirqs before calling rcu_core(), so it behaves
pretty much the same as a softirq. Or am I missing something?

> So you rely on some implicit behaviour which breaks with RT such as:
> 
>                               CPU 0
>            -----------------------------------------------
>            RANDOM TASK-A                      RANDOM TASK-B
>            ------                             -----------
>            int *X = &per_cpu(CPUX, 0)         int *X = &per_cpu(CPUX, 0)
>            int A, B;                          
> 					      spin_lock(&D);
>            spin_lock(&C);
> 	   				      WRITE_ONCE(*X, 0);
>            A = READ_ONCE(*X);
>                                               WRITE_ONCE(*X, 1);
>            B = READ_ONCE(*X);
> 
> while spinlock C and D are just random locks not related to CPUX but it
> just happens that they are held at that time. So for !RT you guarantee
> that A == B while it is not the case on RT.

Not sure which spinlocks you are referring to here. Also most RCU spinlocks
are raw.

> 
> Sebastian
Sebastian Andrzej Siewior Sept. 22, 2021, 1:02 p.m. UTC | #14
On 2021-09-22 13:38:20 [+0200], Frederic Weisbecker wrote:
> > The part with rcutree.use_softirq=0 on RT does not make it any better,
> > right?
> 
> The rcuc kthread disables softirqs before calling rcu_core(), so it behaves
> pretty much the same as a softirq. Or am I missing something?

Oh, no you don't.

> > So you rely on some implicit behaviour which breaks with RT such as:
> > 
> >                               CPU 0
> >            -----------------------------------------------
> >            RANDOM TASK-A                      RANDOM TASK-B
> >            ------                             -----------
> >            int *X = &per_cpu(CPUX, 0)         int *X = &per_cpu(CPUX, 0)
> >            int A, B;                          
> > 					      spin_lock(&D);
> >            spin_lock(&C);
> > 	   				      WRITE_ONCE(*X, 0);
> >            A = READ_ONCE(*X);
> >                                               WRITE_ONCE(*X, 1);
> >            B = READ_ONCE(*X);
> > 
> > while spinlock C and D are just random locks not related to CPUX but it
> > just happens that they are held at that time. So for !RT you guarantee
> > that A == B while it is not the case on RT.
> 
> Not sure which spinlocks you are referring to here. Also most RCU spinlocks
> are raw.

I was bringing an example where you also could rely on implicit locking
provided by spin_lock() which breaks on RT.

Sebastian
Frederic Weisbecker Sept. 23, 2021, 10:02 a.m. UTC | #15
On Wed, Sep 22, 2021 at 03:02:32PM +0200, Sebastian Andrzej Siewior wrote:
> On 2021-09-22 13:38:20 [+0200], Frederic Weisbecker wrote:
> > > So you rely on some implicit behaviour which breaks with RT such as:
> > > 
> > >                               CPU 0
> > >            -----------------------------------------------
> > >            RANDOM TASK-A                      RANDOM TASK-B
> > >            ------                             -----------
> > >            int *X = &per_cpu(CPUX, 0)         int *X = &per_cpu(CPUX, 0)
> > >            int A, B;                          
> > > 					      spin_lock(&D);
> > >            spin_lock(&C);
> > > 	   				      WRITE_ONCE(*X, 0);
> > >            A = READ_ONCE(*X);
> > >                                               WRITE_ONCE(*X, 1);
> > >            B = READ_ONCE(*X);
> > > 
> > > while spinlock C and D are just random locks not related to CPUX but it
> > > just happens that they are held at that time. So for !RT you guarantee
> > > that A == B while it is not the case on RT.
> > 
> > Not sure which spinlocks you are referring to here. Also most RCU spinlocks
> > are raw.
> 
> I was bringing an example where you also could rely on implicit locking
> provided by spin_lock() which breaks on RT.

Good point!
Valentin Schneider Sept. 30, 2021, 9 a.m. UTC | #16
Hi,

On 21/09/21 23:12, Thomas Gleixner wrote:
> Valentin reported warnings about suspicious RCU usage on RT kernels. Those
> happen when offloading of RCU callbacks is enabled:
>
>   WARNING: suspicious RCU usage
>   5.13.0-rt1 #20 Not tainted
>   -----------------------------
>   kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
>
>   rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
>   rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
>   rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
>
> The reason is that rcu_rdp_is_offloaded() is invoked without one of the
> required protections on RT enabled kernels because local_bh_disable() does
> not disable preemption on RT.
>
> Valentin proposed to add a local lock to the code in question, but that's
> suboptimal in several aspects:
>
>   1) local locks add extra code to !RT kernels for no value.
>
>   2) All possible callsites have to audited and amended when affected
>      possible at an outer function level due to lock nesting issues.
>
>   3) As the local lock has to be taken at the outer functions it's required
>      to release and reacquire them in the inner code sections which might
>      voluntary schedule, e.g. rcu_do_batch().
>
> Both callsites of rcu_rdp_is_offloaded() which trigger this check invoke
> rcu_rdp_is_offloaded() in the variable declaration section right at the top
> of the functions. But the actual usage of the result is either within a
> section which provides the required protections or after such a section.
>
> So the obvious solution is to move the invocation into the code sections
> which provide the proper protections, which solves the problem for RT and
> does not have any impact on !RT kernels.
>

Thanks for taking a look at this!

My reasoning for adding protection in the outer functions was to prevent
impaired unlocks of rcu_nocb_{un}lock_irqsave(), as that too depends on the
offload state. Cf. Frederic's writeup:

  http://lore.kernel.org/r/20210727230814.GC283787@lothringen

Anywho, I see Frederic has sent a fancy new series, let me go stare at it.
Frederic Weisbecker Sept. 30, 2021, 10:53 a.m. UTC | #17
On Thu, Sep 30, 2021 at 10:00:39AM +0100, Valentin Schneider wrote:
> Hi,
> 
> On 21/09/21 23:12, Thomas Gleixner wrote:
> > Valentin reported warnings about suspicious RCU usage on RT kernels. Those
> > happen when offloading of RCU callbacks is enabled:
> >
> >   WARNING: suspicious RCU usage
> >   5.13.0-rt1 #20 Not tainted
> >   -----------------------------
> >   kernel/rcu/tree_plugin.h:69 Unsafe read of RCU_NOCB offloaded state!
> >
> >   rcu_rdp_is_offloaded (kernel/rcu/tree_plugin.h:69 kernel/rcu/tree_plugin.h:58)
> >   rcu_core (kernel/rcu/tree.c:2332 kernel/rcu/tree.c:2398 kernel/rcu/tree.c:2777)
> >   rcu_cpu_kthread (./include/linux/bottom_half.h:32 kernel/rcu/tree.c:2876)
> >
> > The reason is that rcu_rdp_is_offloaded() is invoked without one of the
> > required protections on RT enabled kernels because local_bh_disable() does
> > not disable preemption on RT.
> >
> > Valentin proposed to add a local lock to the code in question, but that's
> > suboptimal in several aspects:
> >
> >   1) local locks add extra code to !RT kernels for no value.
> >
> >   2) All possible callsites have to audited and amended when affected
> >      possible at an outer function level due to lock nesting issues.
> >
> >   3) As the local lock has to be taken at the outer functions it's required
> >      to release and reacquire them in the inner code sections which might
> >      voluntary schedule, e.g. rcu_do_batch().
> >
> > Both callsites of rcu_rdp_is_offloaded() which trigger this check invoke
> > rcu_rdp_is_offloaded() in the variable declaration section right at the top
> > of the functions. But the actual usage of the result is either within a
> > section which provides the required protections or after such a section.
> >
> > So the obvious solution is to move the invocation into the code sections
> > which provide the proper protections, which solves the problem for RT and
> > does not have any impact on !RT kernels.
> >
> 
> Thanks for taking a look at this!
> 
> My reasoning for adding protection in the outer functions was to prevent
> impaired unlocks of rcu_nocb_{un}lock_irqsave(), as that too depends on the
> offload state. Cf. Frederic's writeup:
> 
>   http://lore.kernel.org/r/20210727230814.GC283787@lothringen

I was wrong about that BTW!
Because rcu_nocb_lock() always require IRQs to be disabled, which of course disables
preemption, so the offloaded state can't change between
rcu_nocb_lock[_irqsave]() and rcu_nocb_unlock[_irqrestore]() but anyway there
were many other issues to fix :-)


> 
> Anywho, I see Frederic has sent a fancy new series, let me go stare at it.
Valentin Schneider Sept. 30, 2021, 1:22 p.m. UTC | #18
On 30/09/21 12:53, Frederic Weisbecker wrote:
> On Thu, Sep 30, 2021 at 10:00:39AM +0100, Valentin Schneider wrote:
>> My reasoning for adding protection in the outer functions was to prevent
>> impaired unlocks of rcu_nocb_{un}lock_irqsave(), as that too depends on the
>> offload state. Cf. Frederic's writeup:
>>
>>   http://lore.kernel.org/r/20210727230814.GC283787@lothringen
>
> I was wrong about that BTW!
> Because rcu_nocb_lock() always require IRQs to be disabled, which of course disables
> preemption, so the offloaded state can't change between
> rcu_nocb_lock[_irqsave]() and rcu_nocb_unlock[_irqrestore]() but anyway there
> were many other issues to fix :-)
>

Ooooh... Even with you pointing it out, it took me a while to see it that
way. It's tough to get out of holidays mode :-)
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 51f24ecd94b2..caadfec994f3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -87,6 +87,7 @@  static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, rcu_data) = {
 	.dynticks = ATOMIC_INIT(RCU_DYNTICK_CTRL_CTR),
 #ifdef CONFIG_RCU_NOCB_CPU
 	.cblist.flags = SEGCBLIST_SOFTIRQ_ONLY,
+	.nocb_local_lock =  INIT_LOCAL_LOCK(nocb_local_lock),
 #endif
 };
 static struct rcu_state rcu_state = {
@@ -2853,10 +2854,12 @@  static void rcu_cpu_kthread(unsigned int cpu)
 {
 	unsigned int *statusp = this_cpu_ptr(&rcu_data.rcu_cpu_kthread_status);
 	char work, *workp = this_cpu_ptr(&rcu_data.rcu_cpu_has_work);
+	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 	int spincnt;
 
 	trace_rcu_utilization(TPS("Start CPU kthread@rcu_run"));
 	for (spincnt = 0; spincnt < 10; spincnt++) {
+		rcu_nocb_local_lock(rdp);
 		local_bh_disable();
 		*statusp = RCU_KTHREAD_RUNNING;
 		local_irq_disable();
@@ -2866,6 +2869,7 @@  static void rcu_cpu_kthread(unsigned int cpu)
 		if (work)
 			rcu_core();
 		local_bh_enable();
+		rcu_nocb_local_unlock(rdp);
 		if (*workp == 0) {
 			trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
 			*statusp = RCU_KTHREAD_WAITING;
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 305cf6aeb408..aa6831255fec 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -210,6 +210,8 @@  struct rcu_data {
 	struct timer_list nocb_timer;	/* Enforce finite deferral. */
 	unsigned long nocb_gp_adv_time;	/* Last call_rcu() CB adv (jiffies). */
 
+	local_lock_t nocb_local_lock;
+
 	/* The following fields are used by call_rcu, hence own cacheline. */
 	raw_spinlock_t nocb_bypass_lock ____cacheline_internodealigned_in_smp;
 	struct rcu_cblist nocb_bypass;	/* Lock-contention-bypass CB list. */
@@ -445,6 +447,8 @@  static void rcu_nocb_unlock(struct rcu_data *rdp);
 static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
 				       unsigned long flags);
 static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp);
+static void rcu_nocb_local_lock(struct rcu_data *rdp);
+static void rcu_nocb_local_unlock(struct rcu_data *rdp);
 #ifdef CONFIG_RCU_NOCB_CPU
 static void __init rcu_organize_nocb_kthreads(void);
 #define rcu_nocb_lock_irqsave(rdp, flags)				\
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0ff5e4fb933e..11c4ff00afde 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -21,6 +21,17 @@  static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
 	return lockdep_is_held(&rdp->nocb_lock);
 }
 
+static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
+{
+	return lockdep_is_held(
+#ifdef CONFIG_PREEMPT_RT
+		&rdp->nocb_local_lock.lock
+#else
+		&rdp->nocb_local_lock
+#endif
+	);
+}
+
 static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
 {
 	/* Race on early boot between thread creation and assignment */
@@ -38,7 +49,10 @@  static inline int rcu_lockdep_is_held_nocb(struct rcu_data *rdp)
 {
 	return 0;
 }
-
+static inline int rcu_lockdep_is_held_nocb_local(struct rcu_data *rdp)
+{
+	return 0;
+}
 static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
 {
 	return false;
@@ -46,23 +60,44 @@  static inline bool rcu_current_is_nocb_kthread(struct rcu_data *rdp)
 
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
+/*
+ * Is a local read of the rdp's offloaded state safe and stable?
+ * See rcu_nocb_local_lock() & family.
+ */
+static inline bool rcu_local_offload_access_safe(struct rcu_data *rdp)
+{
+	if (!preemptible())
+		return true;
+
+	if (!migratable()) {
+		if (!IS_ENABLED(CONFIG_RCU_NOCB))
+			return true;
+
+		return rcu_lockdep_is_held_nocb_local(rdp);
+	}
+
+	return false;
+}
+
 static bool rcu_rdp_is_offloaded(struct rcu_data *rdp)
 {
 	/*
-	 * In order to read the offloaded state of an rdp is a safe
-	 * and stable way and prevent from its value to be changed
-	 * under us, we must either hold the barrier mutex, the cpu
-	 * hotplug lock (read or write) or the nocb lock. Local
-	 * non-preemptible reads are also safe. NOCB kthreads and
-	 * timers have their own means of synchronization against the
-	 * offloaded state updaters.
+	 * In order to read the offloaded state of an rdp is a safe and stable
+	 * way and prevent from its value to be changed under us, we must either...
 	 */
 	RCU_LOCKDEP_WARN(
+		// ...hold the barrier mutex...
 		!(lockdep_is_held(&rcu_state.barrier_mutex) ||
+		  // ... the cpu hotplug lock (read or write)...
 		  (IS_ENABLED(CONFIG_HOTPLUG_CPU) && lockdep_is_cpus_held()) ||
+		  // ... or the NOCB lock.
 		  rcu_lockdep_is_held_nocb(rdp) ||
+		  // Local reads still require the local state to remain stable
+		  // (preemption disabled / local lock held)
 		  (rdp == this_cpu_ptr(&rcu_data) &&
-		   !(IS_ENABLED(CONFIG_PREEMPT_COUNT) && preemptible())) ||
+		   rcu_local_offload_access_safe(rdp)) ||
+		  // NOCB kthreads and timers have their own means of synchronization
+		  // against the offloaded state updaters.
 		  rcu_current_is_nocb_kthread(rdp)),
 		"Unsafe read of RCU_NOCB offloaded state"
 	);
@@ -1629,6 +1664,22 @@  static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
 	}
 }
 
+/*
+ * The invocation of rcu_core() within the RCU core kthreads remains preemptible
+ * under PREEMPT_RT, thus the offload state of a CPU could change while
+ * said kthreads are preempted. Prevent this from happening by protecting the
+ * offload state with a local_lock().
+ */
+static void rcu_nocb_local_lock(struct rcu_data *rdp)
+{
+	local_lock(&rcu_data.nocb_local_lock);
+}
+
+static void rcu_nocb_local_unlock(struct rcu_data *rdp)
+{
+	local_unlock(&rcu_data.nocb_local_lock);
+}
+
 /* Lockdep check that ->cblist may be safely accessed. */
 static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
 {
@@ -2396,6 +2447,7 @@  static int rdp_offload_toggle(struct rcu_data *rdp,
 	if (rdp->nocb_cb_sleep)
 		rdp->nocb_cb_sleep = false;
 	rcu_nocb_unlock_irqrestore(rdp, flags);
+	rcu_nocb_local_unlock(rdp);
 
 	/*
 	 * Ignore former value of nocb_cb_sleep and force wake up as it could
@@ -2427,6 +2479,7 @@  static long rcu_nocb_rdp_deoffload(void *arg)
 
 	pr_info("De-offloading %d\n", rdp->cpu);
 
+	rcu_nocb_local_lock(rdp);
 	rcu_nocb_lock_irqsave(rdp, flags);
 	/*
 	 * Flush once and for all now. This suffices because we are
@@ -2509,6 +2562,7 @@  static long rcu_nocb_rdp_offload(void *arg)
 	 * Can't use rcu_nocb_lock_irqsave() while we are in
 	 * SEGCBLIST_SOFTIRQ_ONLY mode.
 	 */
+	rcu_nocb_local_lock(rdp);
 	raw_spin_lock_irqsave(&rdp->nocb_lock, flags);
 
 	/*
@@ -2868,6 +2922,16 @@  static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp,
 	local_irq_restore(flags);
 }
 
+/* No ->nocb_local_lock to acquire. */
+static void rcu_nocb_local_lock(struct rcu_data *rdp)
+{
+}
+
+/* No ->nocb_local_lock to release. */
+static void rcu_nocb_local_unlock(struct rcu_data *rdp)
+{
+}
+
 /* Lockdep check that ->cblist may be safely accessed. */
 static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp)
 {