Message ID | 20221220141625.3612085-1-qiang1.zhang@intel.com (mailing list archive) |
---|---|
State | Accepted |
Commit | f22caef6cda5ed19a55ec2e703f60f1fa85e52bc |
Headers | show |
Series | rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check | expand |
On Tue, Dec 20, 2022 at 10:16:25PM +0800, Zqiang wrote: > This commit add TICK_DEP_MASK_RCU_EXP dependency check in > check_tick_dependency(), fix commit df1e849ae4559 ("rcu: Enable > tick for nohz_full CPUs slow to provide expedited QS"). > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> Again, good eyes, thank you!!! I have queued this for further review and testing, given that it affects pretty much only RCU. However, if someone else would rather take it: Acked-by: Paul E. McKenney <paulmck@kernel.org> Thanx, Paul ------------------------------------------------------------------------ commit f22caef6cda5ed19a55ec2e703f60f1fa85e52bc Author: Zqiang <qiang1.zhang@intel.com> Date: Tue Dec 20 22:16:25 2022 +0800 rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling RCU expedited grace periods to actually force-enable scheduling-clock interrupts on holdout CPUs. Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS") Signed-off-by: Zqiang <qiang1.zhang@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h index 2e713a7d9aa3a..3e8619c72f774 100644 --- a/include/trace/events/timer.h +++ b/include/trace/events/timer.h @@ -371,7 +371,8 @@ TRACE_EVENT(itimer_expire, tick_dep_name(PERF_EVENTS) \ tick_dep_name(SCHED) \ tick_dep_name(CLOCK_UNSTABLE) \ - tick_dep_name_end(RCU) + tick_dep_name(RCU) \ + tick_dep_name_end(RCU_EXP) #undef tick_dep_name #undef tick_dep_mask_name diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index b0e3c9205946f..ba2ac1469d473 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -281,6 +281,11 @@ static bool check_tick_dependency(atomic_t *dep) return true; } + if (val & TICK_DEP_MASK_RCU_EXP) { + trace_tick_stop(0, TICK_DEP_MASK_RCU_EXP); + return true; + } + return false; }
On Wed, Dec 21, 2022 at 12:16:34PM -0800, Paul E. McKenney wrote: > On Tue, Dec 20, 2022 at 10:16:25PM +0800, Zqiang wrote: > > This commit add TICK_DEP_MASK_RCU_EXP dependency check in > > check_tick_dependency(), fix commit df1e849ae4559 ("rcu: Enable > > tick for nohz_full CPUs slow to provide expedited QS"). > > > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > > Again, good eyes, thank you!!! > > I have queued this for further review and testing, given that it affects > pretty much only RCU. However, if someone else would rather take it: > > Acked-by: Paul E. McKenney <paulmck@kernel.org> Thanks for picking it up! Please also add the following tags: Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS") Acked-by: Frederic Weisbecker <frederic@kernel.org> Thanks!
On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote:
> (lost html content)
I can't find a place where the exp grace period sends an IPI to
CPUs slow to report a QS. But anyway you really need the tick to poll
periodically on the CPU to chase a quiescent state.
Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y
and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled
section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled
by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y.
So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n
(and there is also PREEMPT_DYNAMIC to consider).
If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user
as QS, but those are already reported explicitly on ct_kernel_exit() ->
rcu_preempt_deferred_qs().
Thanks.
> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote: >> (lost html content) My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... > I can't find a place where the exp grace period sends an IPI to > CPUs slow to report a QS. But anyway you really need the tick to poll > periodically on the CPU to chase a quiescent state. Ok. > Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y > and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled > section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled > by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y. > So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n > (and there is also PREEMPT_DYNAMIC to consider). Makes sense. I think I was missing this use case and was going by the general design of exp grace periods. I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it? But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. > If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user > as QS, but those are already reported explicitly on ct_kernel_exit() -> > rcu_preempt_deferred_qs(). Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS? Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel. We ought to start another Google doc on all of this if we have not yet… Thanks! - Joel > > Thanks. > >
> On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote: > > >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote: >>> >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote: >>> (lost html content) > > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... > >> I can't find a place where the exp grace period sends an IPI to >> CPUs slow to report a QS. But anyway you really need the tick to poll >> periodically on the CPU to chase a quiescent state. > > Ok. > >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y. >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n >> (and there is also PREEMPT_DYNAMIC to consider). > > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods. I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it? > > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. > >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user >> as QS, but those are already reported explicitly on ct_kernel_exit() -> >> rcu_preempt_deferred_qs(). > > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS? Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel. I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again. This RCU thing… Thanks. > > We ought to start another Google doc on all of this if we have not yet… > > Thanks! > > - Joel > >> >> Thanks. >> >>
On Sat, Jan 07, 2023 at 09:55:22PM -0500, Joel Fernandes wrote: > > > > On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > >>> > >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote: > >>> (lost html content) > > > > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... > > > >> I can't find a place where the exp grace period sends an IPI to > >> CPUs slow to report a QS. But anyway you really need the tick to poll > >> periodically on the CPU to chase a quiescent state. > > > > Ok. > > > >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y > >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled > >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled > >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y. > >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n > >> (and there is also PREEMPT_DYNAMIC to consider). > > > > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods. I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it? > > > > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. > > > >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user > >> as QS, but those are already reported explicitly on ct_kernel_exit() -> > >> rcu_preempt_deferred_qs(). > > > > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS? Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel. > > I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again. This RCU thing… Right, so when CONFIG_PREEMPT_COUNT=n, there is no way for a tick to tell if the the interrupted code is safely considered as a QS. That's because preempt_disable() <-> preempt_enable() are no-ops so the whole kernel is assumed non-preemptible, and therefore the whole kernel is a READ side critical section, except for the explicit points reporting a QS. The only exception is when the tick interrupts idle (or user with nohz_full). But we already have an exp QS reported on idle (and user with nohz_full) entry through ct_kernel_exit(), and that happens on all RCU_TREE configs (PREEMPT or not). Therefore the tick doesn't appear to be helpful at all on a nohz_full CPU with CONFIG_PREEMPT_COUNT=n. I suggest we don't bother optimizing that case though... To summarize: 1) nohz_full && !CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU: Tick isn't helpful. It can only report idle/user QS, but that is already reported explicitly. 2) nohz_full && CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU: Tick is very helpful because it can tell if the kernel is in a QS state. 3) nohz_full && CONFIG_PREEMPT_RCU: Tick doesn't appear to be helpful because: - If the rcu_exp_handler() fires in an rcu_read_lock'ed section, then the exp QS is reported on rcu_read_unlock() - If the rcu_exp_handler() fires in a preempt/bh disabled section, TIF_RESCHED is forced which is handled on preempt/bh re-enablement, reporting a QS. The case 2) is a niche, only useful for debugging. But anyway I'm not sure it's worth changing/optimizing the current state. Might be worth add a comment though. Thanks.
On Sat, Jan 07, 2023 at 10:39:22PM +0100, Frederic Weisbecker wrote: > On Wed, Dec 21, 2022 at 12:16:34PM -0800, Paul E. McKenney wrote: > > On Tue, Dec 20, 2022 at 10:16:25PM +0800, Zqiang wrote: > > > This commit add TICK_DEP_MASK_RCU_EXP dependency check in > > > check_tick_dependency(), fix commit df1e849ae4559 ("rcu: Enable > > > tick for nohz_full CPUs slow to provide expedited QS"). > > > > > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > > > > Again, good eyes, thank you!!! > > > > I have queued this for further review and testing, given that it affects > > pretty much only RCU. However, if someone else would rather take it: > > > > Acked-by: Paul E. McKenney <paulmck@kernel.org> > > Thanks for picking it up! Please also add the following tags: > > Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS") > Acked-by: Frederic Weisbecker <frederic@kernel.org> Thank you for looking it over! I will apply these on my next rebase. Thanx, Paul
On Mon, Jan 09, 2023 at 12:09:48AM +0100, Frederic Weisbecker wrote: > On Sat, Jan 07, 2023 at 09:55:22PM -0500, Joel Fernandes wrote: > > > > > > > On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > >>> > > >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote: > > >>> (lost html content) > > > > > > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... > > > > > >> I can't find a place where the exp grace period sends an IPI to > > >> CPUs slow to report a QS. But anyway you really need the tick to poll > > >> periodically on the CPU to chase a quiescent state. > > > > > > Ok. > > > > > >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y > > >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled > > >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled > > >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y. > > >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n > > >> (and there is also PREEMPT_DYNAMIC to consider). > > > > > > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods. I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it? > > > > > > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. > > > > > >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user > > >> as QS, but those are already reported explicitly on ct_kernel_exit() -> > > >> rcu_preempt_deferred_qs(). > > > > > > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS? Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel. > > > > I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again. This RCU thing… > > Right, so when CONFIG_PREEMPT_COUNT=n, there is no way for a tick to tell if the > the interrupted code is safely considered as a QS. That's because > preempt_disable() <-> preempt_enable() are no-ops so the whole kernel is > assumed non-preemptible, and therefore the whole kernel is a READ side critical > section, except for the explicit points reporting a QS. > > The only exception is when the tick interrupts idle (or user with > nohz_full). But we already have an exp QS reported on idle (and user with > nohz_full) entry through ct_kernel_exit(), and that happens on all RCU_TREE > configs (PREEMPT or not). Therefore the tick doesn't appear to be helpful at > all on a nohz_full CPU with CONFIG_PREEMPT_COUNT=n. > > I suggest we don't bother optimizing that case though... > > To summarize: > > 1) nohz_full && !CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU: > Tick isn't helpful. It can only report idle/user QS, but that is > already reported explicitly. > > 2) nohz_full && CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU: > Tick is very helpful because it can tell if the kernel is in > a QS state. > > 3) nohz_full && CONFIG_PREEMPT_RCU: > Tick doesn't appear to be helpful because: > - If the rcu_exp_handler() fires in an rcu_read_lock'ed section, then the > exp QS is reported on rcu_read_unlock() > - If the rcu_exp_handler() fires in a preempt/bh disabled section, > TIF_RESCHED is forced which is handled on preempt/bh re-enablement, > reporting a QS. > > > The case 2) is a niche, only useful for debugging. But anyway I'm not sure it's > worth changing/optimizing the current state. Might be worth add a comment > though. Thank you both for the analysis! I would welcome a comment. One could argue that we should increase the delay before turning the tick on, but my experience is that expedited grace periods almost always complete in less than a jiffy, so there would almost never be any benefit in doing so. But if some large NO_HZ_FULL system with long RCU readers starts having trouble with too-frequent tick enablement, that is one possible fix. Thanx, Paul
On Mon, Jan 09, 2023 at 11:32:26AM -0800, Paul E. McKenney wrote: > On Mon, Jan 09, 2023 at 12:09:48AM +0100, Frederic Weisbecker wrote: > > On Sat, Jan 07, 2023 at 09:55:22PM -0500, Joel Fernandes wrote: > > > > > > > > > > On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > >>> > > > >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote: > > > >>> (lost html content) > > > > > > > > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... > > > > > > > >> I can't find a place where the exp grace period sends an IPI to > > > >> CPUs slow to report a QS. But anyway you really need the tick to poll > > > >> periodically on the CPU to chase a quiescent state. > > > > > > > > Ok. > > > > > > > >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y > > > >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled > > > >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled > > > >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y. > > > >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n > > > >> (and there is also PREEMPT_DYNAMIC to consider). > > > > > > > > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods. I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it? > > > > > > > > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. > > > > > > > >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user > > > >> as QS, but those are already reported explicitly on ct_kernel_exit() -> > > > >> rcu_preempt_deferred_qs(). > > > > > > > > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS? Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel. > > > > > > I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again. This RCU thing… > > > > Right, so when CONFIG_PREEMPT_COUNT=n, there is no way for a tick to tell if the > > the interrupted code is safely considered as a QS. That's because > > preempt_disable() <-> preempt_enable() are no-ops so the whole kernel is > > assumed non-preemptible, and therefore the whole kernel is a READ side critical > > section, except for the explicit points reporting a QS. > > > > The only exception is when the tick interrupts idle (or user with > > nohz_full). But we already have an exp QS reported on idle (and user with > > nohz_full) entry through ct_kernel_exit(), and that happens on all RCU_TREE > > configs (PREEMPT or not). Therefore the tick doesn't appear to be helpful at > > all on a nohz_full CPU with CONFIG_PREEMPT_COUNT=n. > > > > I suggest we don't bother optimizing that case though... > > > > To summarize: > > > > 1) nohz_full && !CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU: > > Tick isn't helpful. It can only report idle/user QS, but that is > > already reported explicitly. > > > > 2) nohz_full && CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU: > > Tick is very helpful because it can tell if the kernel is in > > a QS state. > > > > 3) nohz_full && CONFIG_PREEMPT_RCU: > > Tick doesn't appear to be helpful because: > > - If the rcu_exp_handler() fires in an rcu_read_lock'ed section, then the > > exp QS is reported on rcu_read_unlock() > > - If the rcu_exp_handler() fires in a preempt/bh disabled section, > > TIF_RESCHED is forced which is handled on preempt/bh re-enablement, > > reporting a QS. > > > > > > The case 2) is a niche, only useful for debugging. But anyway I'm not sure it's > > worth changing/optimizing the current state. Might be worth add a comment > > though. > > Thank you both for the analysis! I would welcome a comment. I'm preparing that. > One could argue that we should increase the delay before turning the > tick on, but my experience is that expedited grace periods almost always > complete in less than a jiffy, so there would almost never be any benefit > in doing so. But if some large NO_HZ_FULL system with long RCU readers > starts having trouble with too-frequent tick enablement, that is one > possible fix. And last but not least: wait for anybody to complain before changing anything ;-)) Thanks.
On Tue, Jan 10, 2023 at 12:55:51AM +0100, Frederic Weisbecker wrote: > On Mon, Jan 09, 2023 at 11:32:26AM -0800, Paul E. McKenney wrote: > > On Mon, Jan 09, 2023 at 12:09:48AM +0100, Frederic Weisbecker wrote: > > > On Sat, Jan 07, 2023 at 09:55:22PM -0500, Joel Fernandes wrote: > > > > > On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote: > > > > >>> > > > > >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote: > > > > >>> (lost html content) > > > > > > > > > > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... > > > > > > > > > >> I can't find a place where the exp grace period sends an IPI to > > > > >> CPUs slow to report a QS. But anyway you really need the tick to poll > > > > >> periodically on the CPU to chase a quiescent state. > > > > > > > > > > Ok. > > > > > > > > > >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y > > > > >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled > > > > >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled > > > > >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y. > > > > >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n > > > > >> (and there is also PREEMPT_DYNAMIC to consider). > > > > > > > > > > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods. I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it? > > > > > > > > > > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. > > > > > > > > > >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user > > > > >> as QS, but those are already reported explicitly on ct_kernel_exit() -> > > > > >> rcu_preempt_deferred_qs(). > > > > > > > > > > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS? Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel. > > > > > > > > I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again. This RCU thing… > > > > > > Right, so when CONFIG_PREEMPT_COUNT=n, there is no way for a tick to tell if the > > > the interrupted code is safely considered as a QS. That's because > > > preempt_disable() <-> preempt_enable() are no-ops so the whole kernel is > > > assumed non-preemptible, and therefore the whole kernel is a READ side critical > > > section, except for the explicit points reporting a QS. > > > > > > The only exception is when the tick interrupts idle (or user with > > > nohz_full). But we already have an exp QS reported on idle (and user with > > > nohz_full) entry through ct_kernel_exit(), and that happens on all RCU_TREE > > > configs (PREEMPT or not). Therefore the tick doesn't appear to be helpful at > > > all on a nohz_full CPU with CONFIG_PREEMPT_COUNT=n. > > > > > > I suggest we don't bother optimizing that case though... > > > > > > To summarize: > > > > > > 1) nohz_full && !CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU: > > > Tick isn't helpful. It can only report idle/user QS, but that is > > > already reported explicitly. > > > > > > 2) nohz_full && CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU: > > > Tick is very helpful because it can tell if the kernel is in > > > a QS state. > > > > > > 3) nohz_full && CONFIG_PREEMPT_RCU: > > > Tick doesn't appear to be helpful because: > > > - If the rcu_exp_handler() fires in an rcu_read_lock'ed section, then the > > > exp QS is reported on rcu_read_unlock() > > > - If the rcu_exp_handler() fires in a preempt/bh disabled section, > > > TIF_RESCHED is forced which is handled on preempt/bh re-enablement, > > > reporting a QS. > > > > > > > > > The case 2) is a niche, only useful for debugging. But anyway I'm not sure it's > > > worth changing/optimizing the current state. Might be worth add a comment > > > though. > > > > Thank you both for the analysis! I would welcome a comment. > > I'm preparing that. > > > One could argue that we should increase the delay before turning the > > tick on, but my experience is that expedited grace periods almost always > > complete in less than a jiffy, so there would almost never be any benefit > > in doing so. But if some large NO_HZ_FULL system with long RCU readers > > starts having trouble with too-frequent tick enablement, that is one > > possible fix. > > And last but not least: wait for anybody to complain before changing anything > ;-)) Well said! Up to a point, anyway. ;-) Thanx, Paul
diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h index 2e713a7d9aa3..3e8619c72f77 100644 --- a/include/trace/events/timer.h +++ b/include/trace/events/timer.h @@ -371,7 +371,8 @@ TRACE_EVENT(itimer_expire, tick_dep_name(PERF_EVENTS) \ tick_dep_name(SCHED) \ tick_dep_name(CLOCK_UNSTABLE) \ - tick_dep_name_end(RCU) + tick_dep_name(RCU) \ + tick_dep_name_end(RCU_EXP) #undef tick_dep_name #undef tick_dep_mask_name diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index b0e3c9205946..ba2ac1469d47 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -281,6 +281,11 @@ static bool check_tick_dependency(atomic_t *dep) return true; } + if (val & TICK_DEP_MASK_RCU_EXP) { + trace_tick_stop(0, TICK_DEP_MASK_RCU_EXP); + return true; + } + return false; }
This commit add TICK_DEP_MASK_RCU_EXP dependency check in check_tick_dependency(), fix commit df1e849ae4559 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS"). Signed-off-by: Zqiang <qiang1.zhang@intel.com> --- include/trace/events/timer.h | 3 ++- kernel/time/tick-sched.c | 5 +++++ 2 files changed, 7 insertions(+), 1 deletion(-)