Message ID | 20221220112520.3596920-1-qiang1.zhang@intel.com (mailing list archive) |
---|---|
State | Accepted |
Commit | acfe689f2e473fb59b6d2c95af5fe36198bb9a84 |
Headers | show |
Series | rcu: Fix race in set and clear TICK_DEP_BIT_RCU_EXP bitmask | expand |
On Tue, Dec 20, 2022 at 07:25:20PM +0800, Zqiang wrote: > For the kernel bulit with CONFIG_NO_HZ_FULL enabled and the following > cpus is nohz_full cpus: > > CPU1 CPU2 > rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait > acquires rnp->lock mask = rnp->expmask; > for_each_leaf_node_cpu_mask(rnp, cpu, mask) > rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1); > for_each_leaf_node_cpu_mask(rnp, cpu, mask) > rdp = per_cpu_ptr(&rcu_data, cpu1); > if (!rdp->rcu_forced_tick_exp) > continue; rdp->rcu_forced_tick_exp = true; > tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP); > > In the above scenario, after CPU1 reported the quiescent state, CPU1 > misses the opportunity to clear the TICK_DEP_BIT_RCU_EXP bitmask, it > will not be cleared until the next expedited grace period starts and > the CPU1 quiescent state is reported again. during this window period, > the CPU1 whose tick can not be stopped, if CPU1 has only one runnable > task and this task has aggressive real-time response constraints, this > task may have one of the worst response times. > > Therefore, this commit add rnp->lock when set TICK_DEP_BIT_RCU_EXP > bitmask to fix this race. > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> Good eyes, thank you!!! Queued for testing and further review as follows, as always, please check for errors. Thanx, Paul ------------------------------------------------------------------------ commit acfe689f2e473fb59b6d2c95af5fe36198bb9a84 Author: Zqiang <qiang1.zhang@intel.com> Date: Tue Dec 20 19:25:20 2022 +0800 rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race For kernels built with CONFIG_NO_HZ_FULL=y, the following scenario can result in the scheduling-clock interrupt remaining enabled on a holdout CPU after its quiescent state has been reported: CPU1 CPU2 rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait acquires rnp->lock mask = rnp->expmask; for_each_leaf_node_cpu_mask(rnp, cpu, mask) rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1); for_each_leaf_node_cpu_mask(rnp, cpu, mask) rdp = per_cpu_ptr(&rcu_data, cpu1); if (!rdp->rcu_forced_tick_exp) continue; rdp->rcu_forced_tick_exp = true; tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP); The problem is that CPU2's sampling of rnp->expmask is obsolete by the time it invokes tick_dep_set_cpu(), and CPU1 is not guaranteed to see CPU2's store to ->rcu_forced_tick_exp in time to clear it. And even if CPU1 does see that store, it might invoke tick_dep_clear_cpu() before CPU2 got around to executing its tick_dep_set_cpu(), which would still leave the victim CPU with its scheduler-clock tick running. Either way, an nohz_full real-time application running on the victim CPU would have its latency needlessly degraded. Note that expedited RCU grace periods look at context-tracking information, and so if the CPU is executing in nohz_full usermode throughout, that CPU cannot be victimized in this manner. This commit therefore causes synchronize_rcu_expedited_wait to hold the rcu_node structure's ->lock when checking for holdout CPUs, setting TICK_DEP_BIT_RCU_EXP, and invoking tick_dep_set_cpu(), thus preventing this race. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 249c2967d9e6c..7cc4856da0817 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -594,6 +594,7 @@ static void synchronize_rcu_expedited_wait(void) struct rcu_data *rdp; struct rcu_node *rnp; struct rcu_node *rnp_root = rcu_get_root(); + unsigned long flags; trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait")); jiffies_stall = rcu_exp_jiffies_till_stall_check(); @@ -602,17 +603,17 @@ static void synchronize_rcu_expedited_wait(void) if (synchronize_rcu_expedited_wait_once(1)) return; rcu_for_each_leaf_node(rnp) { + raw_spin_lock_irqsave_rcu_node(rnp, flags); mask = READ_ONCE(rnp->expmask); for_each_leaf_node_cpu_mask(rnp, cpu, mask) { rdp = per_cpu_ptr(&rcu_data, cpu); if (rdp->rcu_forced_tick_exp) continue; rdp->rcu_forced_tick_exp = true; - preempt_disable(); if (cpu_online(cpu)) tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); - preempt_enable(); } + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } j = READ_ONCE(jiffies_till_first_fqs); if (synchronize_rcu_expedited_wait_once(j + HZ))
> For the kernel bulit with CONFIG_NO_HZ_FULL enabled and the following > cpus is nohz_full cpus: > > CPU1 CPU2 > rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait > acquires rnp->lock mask = rnp->expmask; > for_each_leaf_node_cpu_mask(rnp, cpu, mask) > rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1); > for_each_leaf_node_cpu_mask(rnp, cpu, mask) > rdp = per_cpu_ptr(&rcu_data, cpu1); > if (!rdp->rcu_forced_tick_exp) > continue; rdp->rcu_forced_tick_exp = true; > > tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP); > > In the above scenario, after CPU1 reported the quiescent state, CPU1 > misses the opportunity to clear the TICK_DEP_BIT_RCU_EXP bitmask, it > will not be cleared until the next expedited grace period starts and > the CPU1 quiescent state is reported again. during this window period, > the CPU1 whose tick can not be stopped, if CPU1 has only one runnable > task and this task has aggressive real-time response constraints, this > task may have one of the worst response times. > > Therefore, this commit add rnp->lock when set TICK_DEP_BIT_RCU_EXP > bitmask to fix this race. > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > >Good eyes, thank you!!! > >Queued for testing and further review as follows, as always, please check for errors. > It looks more clear now, thank you! Thanks Zqiang > Thanx, Paul > >------------------------------------------------------------------------ commit acfe689f2e473fb59b6d2c95af5fe36198bb9a84 Author: Zqiang <qiang1.zhang@intel.com> Date: Tue Dec 20 19:25:20 2022 +0800 rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race For kernels built with CONFIG_NO_HZ_FULL=y, the following scenario can result in the scheduling-clock interrupt remaining enabled on a holdout CPU after its quiescent state has been reported: CPU1 CPU2 rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait acquires rnp->lock mask = rnp->expmask; for_each_leaf_node_cpu_mask(rnp, cpu, mask) rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1); for_each_leaf_node_cpu_mask(rnp, cpu, mask) rdp = per_cpu_ptr(&rcu_data, cpu1); if (!rdp->rcu_forced_tick_exp) continue; rdp->rcu_forced_tick_exp = true; tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP); The problem is that CPU2's sampling of rnp->expmask is obsolete by the time it invokes tick_dep_set_cpu(), and CPU1 is not guaranteed to see CPU2's store to ->rcu_forced_tick_exp in time to clear it. And even if CPU1 does see that store, it might invoke tick_dep_clear_cpu() before CPU2 got around to executing its tick_dep_set_cpu(), which would still leave the victim CPU with its scheduler-clock tick running. Either way, an nohz_full real-time application running on the victim CPU would have its latency needlessly degraded. Note that expedited RCU grace periods look at context-tracking information, and so if the CPU is executing in nohz_full usermode throughout, that CPU cannot be victimized in this manner. This commit therefore causes synchronize_rcu_expedited_wait to hold the rcu_node structure's ->lock when checking for holdout CPUs, setting TICK_DEP_BIT_RCU_EXP, and invoking tick_dep_set_cpu(), thus preventing this race. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 249c2967d9e6c..7cc4856da0817 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -594,6 +594,7 @@ static void synchronize_rcu_expedited_wait(void) struct rcu_data *rdp; struct rcu_node *rnp; struct rcu_node *rnp_root = rcu_get_root(); + unsigned long flags; trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait")); jiffies_stall = rcu_exp_jiffies_till_stall_check(); @@ -602,17 +603,17 @@ static void synchronize_rcu_expedited_wait(void) if (synchronize_rcu_expedited_wait_once(1)) return; rcu_for_each_leaf_node(rnp) { + raw_spin_lock_irqsave_rcu_node(rnp, flags); mask = READ_ONCE(rnp->expmask); for_each_leaf_node_cpu_mask(rnp, cpu, mask) { rdp = per_cpu_ptr(&rcu_data, cpu); if (rdp->rcu_forced_tick_exp) continue; rdp->rcu_forced_tick_exp = true; - preempt_disable(); if (cpu_online(cpu)) tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); - preempt_enable(); } + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } j = READ_ONCE(jiffies_till_first_fqs); if (synchronize_rcu_expedited_wait_once(j + HZ))
On Thu, Dec 22, 2022 at 09:48:14AM +0000, Zhang, Qiang1 wrote: > > For the kernel bulit with CONFIG_NO_HZ_FULL enabled and the following > > cpus is nohz_full cpus: > > > > CPU1 CPU2 > > rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait > > acquires rnp->lock mask = rnp->expmask; > > for_each_leaf_node_cpu_mask(rnp, cpu, mask) > > rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1); > > for_each_leaf_node_cpu_mask(rnp, cpu, mask) > > rdp = per_cpu_ptr(&rcu_data, cpu1); > > if (!rdp->rcu_forced_tick_exp) > > continue; rdp->rcu_forced_tick_exp = true; > > > > tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP); > > > > In the above scenario, after CPU1 reported the quiescent state, CPU1 > > misses the opportunity to clear the TICK_DEP_BIT_RCU_EXP bitmask, it > > will not be cleared until the next expedited grace period starts and > > the CPU1 quiescent state is reported again. during this window period, > > the CPU1 whose tick can not be stopped, if CPU1 has only one runnable > > task and this task has aggressive real-time response constraints, this > > task may have one of the worst response times. > > > > Therefore, this commit add rnp->lock when set TICK_DEP_BIT_RCU_EXP > > bitmask to fix this race. > > > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > > > >Good eyes, thank you!!! > > > >Queued for testing and further review as follows, as always, please check for errors. > > > > It looks more clear now, thank you! Thank you for checking them both! Thanx, Paul > Thanks > Zqiang > > > Thanx, Paul > > > >------------------------------------------------------------------------ > > commit acfe689f2e473fb59b6d2c95af5fe36198bb9a84 > Author: Zqiang <qiang1.zhang@intel.com> > Date: Tue Dec 20 19:25:20 2022 +0800 > > rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race > > For kernels built with CONFIG_NO_HZ_FULL=y, the following scenario can result > in the scheduling-clock interrupt remaining enabled on a holdout CPU after > its quiescent state has been reported: > > CPU1 CPU2 > rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait > acquires rnp->lock mask = rnp->expmask; > for_each_leaf_node_cpu_mask(rnp, cpu, mask) > rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1); > for_each_leaf_node_cpu_mask(rnp, cpu, mask) > rdp = per_cpu_ptr(&rcu_data, cpu1); > if (!rdp->rcu_forced_tick_exp) > continue; rdp->rcu_forced_tick_exp = true; > tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP); > > The problem is that CPU2's sampling of rnp->expmask is obsolete by the > time it invokes tick_dep_set_cpu(), and CPU1 is not guaranteed to see > CPU2's store to ->rcu_forced_tick_exp in time to clear it. And even if > CPU1 does see that store, it might invoke tick_dep_clear_cpu() before > CPU2 got around to executing its tick_dep_set_cpu(), which would still > leave the victim CPU with its scheduler-clock tick running. > > Either way, an nohz_full real-time application running on the victim > CPU would have its latency needlessly degraded. > > Note that expedited RCU grace periods look at context-tracking > information, and so if the CPU is executing in nohz_full usermode > throughout, that CPU cannot be victimized in this manner. > > This commit therefore causes synchronize_rcu_expedited_wait to hold > the rcu_node structure's ->lock when checking for holdout CPUs, setting > TICK_DEP_BIT_RCU_EXP, and invoking tick_dep_set_cpu(), thus preventing > this race. > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 249c2967d9e6c..7cc4856da0817 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -594,6 +594,7 @@ static void synchronize_rcu_expedited_wait(void) > struct rcu_data *rdp; > struct rcu_node *rnp; > struct rcu_node *rnp_root = rcu_get_root(); > + unsigned long flags; > > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait")); > jiffies_stall = rcu_exp_jiffies_till_stall_check(); > @@ -602,17 +603,17 @@ static void synchronize_rcu_expedited_wait(void) > if (synchronize_rcu_expedited_wait_once(1)) > return; > rcu_for_each_leaf_node(rnp) { > + raw_spin_lock_irqsave_rcu_node(rnp, flags); > mask = READ_ONCE(rnp->expmask); > for_each_leaf_node_cpu_mask(rnp, cpu, mask) { > rdp = per_cpu_ptr(&rcu_data, cpu); > if (rdp->rcu_forced_tick_exp) > continue; > rdp->rcu_forced_tick_exp = true; > - preempt_disable(); > if (cpu_online(cpu)) > tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); > - preempt_enable(); > } > + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > } > j = READ_ONCE(jiffies_till_first_fqs); > if (synchronize_rcu_expedited_wait_once(j + HZ))
On Wed, Dec 21, 2022 at 12:08:49PM -0800, Paul E. McKenney wrote: > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > index 249c2967d9e6c..7cc4856da0817 100644 > --- a/kernel/rcu/tree_exp.h > +++ b/kernel/rcu/tree_exp.h > @@ -594,6 +594,7 @@ static void synchronize_rcu_expedited_wait(void) > struct rcu_data *rdp; > struct rcu_node *rnp; > struct rcu_node *rnp_root = rcu_get_root(); > + unsigned long flags; > > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait")); > jiffies_stall = rcu_exp_jiffies_till_stall_check(); > @@ -602,17 +603,17 @@ static void synchronize_rcu_expedited_wait(void) > if (synchronize_rcu_expedited_wait_once(1)) > return; > rcu_for_each_leaf_node(rnp) { > + raw_spin_lock_irqsave_rcu_node(rnp, flags); > mask = READ_ONCE(rnp->expmask); > for_each_leaf_node_cpu_mask(rnp, cpu, mask) { > rdp = per_cpu_ptr(&rcu_data, cpu); > if (rdp->rcu_forced_tick_exp) > continue; > rdp->rcu_forced_tick_exp = true; > - preempt_disable(); > if (cpu_online(cpu)) > tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); > - preempt_enable(); > } > + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > } > j = READ_ONCE(jiffies_till_first_fqs); > if (synchronize_rcu_expedited_wait_once(j + HZ)) Reviewed-by: Frederic Weisbecker <frederic@kernel.org> BTW why are we forcing the tick on the whole node? And shouldn't we set the tick dependency from rcu_exp_handler() instead? Thanks.
On Sat, Dec 31, 2022 at 07:25:08PM +0100, Frederic Weisbecker wrote: > On Wed, Dec 21, 2022 at 12:08:49PM -0800, Paul E. McKenney wrote: > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > > index 249c2967d9e6c..7cc4856da0817 100644 > > --- a/kernel/rcu/tree_exp.h > > +++ b/kernel/rcu/tree_exp.h > > @@ -594,6 +594,7 @@ static void synchronize_rcu_expedited_wait(void) > > struct rcu_data *rdp; > > struct rcu_node *rnp; > > struct rcu_node *rnp_root = rcu_get_root(); > > + unsigned long flags; > > > > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait")); > > jiffies_stall = rcu_exp_jiffies_till_stall_check(); > > @@ -602,17 +603,17 @@ static void synchronize_rcu_expedited_wait(void) > > if (synchronize_rcu_expedited_wait_once(1)) > > return; > > rcu_for_each_leaf_node(rnp) { > > + raw_spin_lock_irqsave_rcu_node(rnp, flags); > > mask = READ_ONCE(rnp->expmask); > > for_each_leaf_node_cpu_mask(rnp, cpu, mask) { > > rdp = per_cpu_ptr(&rcu_data, cpu); > > if (rdp->rcu_forced_tick_exp) > > continue; > > rdp->rcu_forced_tick_exp = true; > > - preempt_disable(); > > if (cpu_online(cpu)) > > tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); > > - preempt_enable(); > > } > > + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > } > > j = READ_ONCE(jiffies_till_first_fqs); > > if (synchronize_rcu_expedited_wait_once(j + HZ)) > > Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Thank you! > BTW why are we forcing the tick on the whole node? Now that you mention it, that would be more precise. > And shouldn't we set the tick dependency from rcu_exp_handler() instead? Because it never occurred to me to check whether this could be invoked from an interrupt handler? ;-) But that does sound like it might be a better approach. Zqiang, would you be willing to look into this? Thanx, Paul
> > >On Sat, Dec 31, 2022 at 07:25:08PM +0100, Frederic Weisbecker wrote: > On Wed, Dec 21, 2022 at 12:08:49PM -0800, Paul E. McKenney wrote: > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index > > 249c2967d9e6c..7cc4856da0817 100644 > > --- a/kernel/rcu/tree_exp.h > > +++ b/kernel/rcu/tree_exp.h > > @@ -594,6 +594,7 @@ static void synchronize_rcu_expedited_wait(void) > > struct rcu_data *rdp; > > struct rcu_node *rnp; > > struct rcu_node *rnp_root = rcu_get_root(); > > + unsigned long flags; > > > > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait")); > > jiffies_stall = rcu_exp_jiffies_till_stall_check(); > > @@ -602,17 +603,17 @@ static void synchronize_rcu_expedited_wait(void) > > if (synchronize_rcu_expedited_wait_once(1)) > > return; > > rcu_for_each_leaf_node(rnp) { > > + raw_spin_lock_irqsave_rcu_node(rnp, flags); > > mask = READ_ONCE(rnp->expmask); > > for_each_leaf_node_cpu_mask(rnp, cpu, mask) { > > rdp = per_cpu_ptr(&rcu_data, cpu); > > if (rdp->rcu_forced_tick_exp) > > continue; > > rdp->rcu_forced_tick_exp = true; > > - preempt_disable(); > > if (cpu_online(cpu)) > > tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); > > - preempt_enable(); > > } > > + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > } > > j = READ_ONCE(jiffies_till_first_fqs); > > if (synchronize_rcu_expedited_wait_once(j + HZ)) > > Reviewed-by: Frederic Weisbecker <frederic@kernel.org> > >Thank you! > > BTW why are we forcing the tick on the whole node? > >Now that you mention it, that would be more precise. > > And shouldn't we set the tick dependency from rcu_exp_handler() instead? > >Because it never occurred to me to check whether this could be invoked from an interrupt handler? ;-) > >But that does sound like it might be a better approach. > >Zqiang, would you be willing to look into this? Yes, and I have a question, we forcing the tick dependency because the expedited grace period is not end for the first time, this means that it is not to set the tick dependency every time. if we set the tick dependency in rcu_exp_handler(), does this mean that every time the expedited grace period starts the tick dependency will be set unconditionally ? Thoughts ? Thanks Zqiang > > Thanx, Paul
On Sun, Jan 01, 2023 at 09:41:57AM +0000, Zhang, Qiang1 wrote: > > > >On Sat, Dec 31, 2022 at 07:25:08PM +0100, Frederic Weisbecker wrote: > > Yes, and I have a question, we forcing the tick dependency because the expedited grace period > is not end for the first time, this means that it is not to set the tick dependency every time. > if we set the tick dependency in rcu_exp_handler(), does this mean that every time the expedited > grace period starts the tick dependency will be set unconditionally ? > > Thoughts ? Only if rcu_exp_handler() fails to report a quiescent state. Then it means we must poll on the CPU looking for a future one. In fact the tick dependency should be set when rdp->cpu_no_qs.b.exp is set to true and cleared when rdp->cpu_no_qs.b.exp is set to false. Thanks. > > Thanks > Zqiang > > > > > Thanx, Paul
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 927abaf6c822..e5fe0099488b 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -593,6 +593,7 @@ static void synchronize_rcu_expedited_wait(void) struct rcu_data *rdp; struct rcu_node *rnp; struct rcu_node *rnp_root = rcu_get_root(); + unsigned long flags; trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait")); jiffies_stall = rcu_exp_jiffies_till_stall_check(); @@ -601,17 +602,17 @@ static void synchronize_rcu_expedited_wait(void) if (synchronize_rcu_expedited_wait_once(1)) return; rcu_for_each_leaf_node(rnp) { + raw_spin_lock_irqsave_rcu_node(rnp, flags); mask = READ_ONCE(rnp->expmask); for_each_leaf_node_cpu_mask(rnp, cpu, mask) { rdp = per_cpu_ptr(&rcu_data, cpu); if (rdp->rcu_forced_tick_exp) continue; rdp->rcu_forced_tick_exp = true; - preempt_disable(); if (cpu_online(cpu)) tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); - preempt_enable(); } + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } j = READ_ONCE(jiffies_till_first_fqs); if (synchronize_rcu_expedited_wait_once(j + HZ))
For the kernel bulit with CONFIG_NO_HZ_FULL enabled and the following cpus is nohz_full cpus: CPU1 CPU2 rcu_report_exp_cpu_mult synchronize_rcu_expedited_wait acquires rnp->lock mask = rnp->expmask; for_each_leaf_node_cpu_mask(rnp, cpu, mask) rnp->expmask = rnp->expmask & ~mask; rdp = per_cpu_ptr(&rcu_data, cpu1); for_each_leaf_node_cpu_mask(rnp, cpu, mask) rdp = per_cpu_ptr(&rcu_data, cpu1); if (!rdp->rcu_forced_tick_exp) continue; rdp->rcu_forced_tick_exp = true; tick_dep_set_cpu(cpu1, TICK_DEP_BIT_RCU_EXP); In the above scenario, after CPU1 reported the quiescent state, CPU1 misses the opportunity to clear the TICK_DEP_BIT_RCU_EXP bitmask, it will not be cleared until the next expedited grace period starts and the CPU1 quiescent state is reported again. during this window period, the CPU1 whose tick can not be stopped, if CPU1 has only one runnable task and this task has aggressive real-time response constraints, this task may have one of the worst response times. Therefore, this commit add rnp->lock when set TICK_DEP_BIT_RCU_EXP bitmask to fix this race. Signed-off-by: Zqiang <qiang1.zhang@intel.com> --- kernel/rcu/tree_exp.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)