diff mbox series

rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check

Message ID 20221220141625.3612085-1-qiang1.zhang@intel.com (mailing list archive)
State Accepted
Commit f22caef6cda5ed19a55ec2e703f60f1fa85e52bc
Headers show
Series rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check | expand

Commit Message

Zqiang Dec. 20, 2022, 2:16 p.m. UTC
This commit add TICK_DEP_MASK_RCU_EXP dependency check in
check_tick_dependency(), fix commit df1e849ae4559 ("rcu: Enable
tick for nohz_full CPUs slow to provide expedited QS").

Signed-off-by: Zqiang <qiang1.zhang@intel.com>
---
 include/trace/events/timer.h | 3 ++-
 kernel/time/tick-sched.c     | 5 +++++
 2 files changed, 7 insertions(+), 1 deletion(-)

Comments

Paul E. McKenney Dec. 21, 2022, 8:16 p.m. UTC | #1
On Tue, Dec 20, 2022 at 10:16:25PM +0800, Zqiang wrote:
> This commit add TICK_DEP_MASK_RCU_EXP dependency check in
> check_tick_dependency(), fix commit df1e849ae4559 ("rcu: Enable
> tick for nohz_full CPUs slow to provide expedited QS").
> 
> Signed-off-by: Zqiang <qiang1.zhang@intel.com>

Again, good eyes, thank you!!!

I have queued this for further review and testing, given that it affects
pretty much only RCU.  However, if someone else would rather take it:

Acked-by: Paul E. McKenney <paulmck@kernel.org>

							Thanx, Paul

------------------------------------------------------------------------

commit f22caef6cda5ed19a55ec2e703f60f1fa85e52bc
Author: Zqiang <qiang1.zhang@intel.com>
Date:   Tue Dec 20 22:16:25 2022 +0800

    rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
    
    This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling
    RCU expedited grace periods to actually force-enable scheduling-clock
    interrupts on holdout CPUs.
    
    Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS")
    Signed-off-by: Zqiang <qiang1.zhang@intel.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Masami Hiramatsu <mhiramat@kernel.org>
    Cc: Frederic Weisbecker <fweisbec@gmail.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 2e713a7d9aa3a..3e8619c72f774 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -371,7 +371,8 @@ TRACE_EVENT(itimer_expire,
 		tick_dep_name(PERF_EVENTS)		\
 		tick_dep_name(SCHED)			\
 		tick_dep_name(CLOCK_UNSTABLE)		\
-		tick_dep_name_end(RCU)
+		tick_dep_name(RCU)			\
+		tick_dep_name_end(RCU_EXP)
 
 #undef tick_dep_name
 #undef tick_dep_mask_name
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b0e3c9205946f..ba2ac1469d473 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -281,6 +281,11 @@ static bool check_tick_dependency(atomic_t *dep)
 		return true;
 	}
 
+	if (val & TICK_DEP_MASK_RCU_EXP) {
+		trace_tick_stop(0, TICK_DEP_MASK_RCU_EXP);
+		return true;
+	}
+
 	return false;
 }
Frederic Weisbecker Jan. 7, 2023, 9:39 p.m. UTC | #2
On Wed, Dec 21, 2022 at 12:16:34PM -0800, Paul E. McKenney wrote:
> On Tue, Dec 20, 2022 at 10:16:25PM +0800, Zqiang wrote:
> > This commit add TICK_DEP_MASK_RCU_EXP dependency check in
> > check_tick_dependency(), fix commit df1e849ae4559 ("rcu: Enable
> > tick for nohz_full CPUs slow to provide expedited QS").
> > 
> > Signed-off-by: Zqiang <qiang1.zhang@intel.com>
> 
> Again, good eyes, thank you!!!
> 
> I have queued this for further review and testing, given that it affects
> pretty much only RCU.  However, if someone else would rather take it:
> 
> Acked-by: Paul E. McKenney <paulmck@kernel.org>


Thanks for picking it up! Please also add the following tags:

Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS")
Acked-by: Frederic Weisbecker <frederic@kernel.org>

Thanks!
Frederic Weisbecker Jan. 7, 2023, 10:11 p.m. UTC | #3
On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote:
> (lost html content)

I can't find a place where the exp grace period sends an IPI to
CPUs slow to report a QS. But anyway you really need the tick to poll
periodically on the CPU to chase a quiescent state.

Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y
and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled
section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled
by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y.
So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n
(and there is also PREEMPT_DYNAMIC to consider).

If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user
as QS, but those are already reported explicitly on ct_kernel_exit() ->
rcu_preempt_deferred_qs().

Thanks.
Joel Fernandes Jan. 8, 2023, 2:48 a.m. UTC | #4
> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote:
>> (lost html content)

My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... 

> I can't find a place where the exp grace period sends an IPI to
> CPUs slow to report a QS. But anyway you really need the tick to poll
> periodically on the CPU to chase a quiescent state.

Ok.

> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y
> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled
> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled
> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y.
> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n
> (and there is also PREEMPT_DYNAMIC to consider).

Makes sense. I think I was missing this use case and was going by the general design of exp grace periods.  I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it?

But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. 

> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user
> as QS, but those are already reported explicitly on ct_kernel_exit() ->
> rcu_preempt_deferred_qs().

Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS?  Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel.

We ought to start another Google doc on all of this if we have not yet…

Thanks!

 - Joel

> 
> Thanks.
> 
>
Joel Fernandes Jan. 8, 2023, 2:55 a.m. UTC | #5
> On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
>>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
>>> 
>>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote:
>>> (lost html content)
> 
> My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... 
> 
>> I can't find a place where the exp grace period sends an IPI to
>> CPUs slow to report a QS. But anyway you really need the tick to poll
>> periodically on the CPU to chase a quiescent state.
> 
> Ok.
> 
>> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y
>> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled
>> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled
>> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y.
>> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n
>> (and there is also PREEMPT_DYNAMIC to consider).
> 
> Makes sense. I think I was missing this use case and was going by the general design of exp grace periods.  I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it?
> 
> But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. 
> 
>> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user
>> as QS, but those are already reported explicitly on ct_kernel_exit() ->
>> rcu_preempt_deferred_qs().
> 
> Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS?  Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel.

I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again.  This RCU thing…

Thanks.


> 
> We ought to start another Google doc on all of this if we have not yet…
> 
> Thanks!
> 
> - Joel
> 
>> 
>> Thanks.
>> 
>>
Frederic Weisbecker Jan. 8, 2023, 11:09 p.m. UTC | #6
On Sat, Jan 07, 2023 at 09:55:22PM -0500, Joel Fernandes wrote:
> 
> 
> > On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > 
> > 
> >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> >>> 
> >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote:
> >>> (lost html content)
> > 
> > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... 
> > 
> >> I can't find a place where the exp grace period sends an IPI to
> >> CPUs slow to report a QS. But anyway you really need the tick to poll
> >> periodically on the CPU to chase a quiescent state.
> > 
> > Ok.
> > 
> >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y
> >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled
> >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled
> >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y.
> >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n
> >> (and there is also PREEMPT_DYNAMIC to consider).
> > 
> > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods.  I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it?
> > 
> > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. 
> > 
> >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user
> >> as QS, but those are already reported explicitly on ct_kernel_exit() ->
> >> rcu_preempt_deferred_qs().
> > 
> > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS?  Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel.
> 
> I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again.  This RCU thing…

Right, so when CONFIG_PREEMPT_COUNT=n, there is no way for a tick to tell if the
the interrupted code is safely considered as a QS. That's because
preempt_disable() <-> preempt_enable() are no-ops so the whole kernel is
assumed non-preemptible, and therefore the whole kernel is a READ side critical
section, except for the explicit points reporting a QS.

The only exception is when the tick interrupts idle (or user with
nohz_full). But we already have an exp QS reported on idle (and user with
nohz_full) entry through ct_kernel_exit(), and that happens on all RCU_TREE
configs (PREEMPT or not). Therefore the tick doesn't appear to be helpful at
all on a nohz_full CPU with CONFIG_PREEMPT_COUNT=n.

I suggest we don't bother optimizing that case though...

To summarize:

1) nohz_full && !CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU:
  Tick isn't helpful. It can only report idle/user QS, but that is
  already reported explicitly.

2) nohz_full && CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU:
  Tick is very helpful because it can tell if the kernel is in
  a QS state.

3) nohz_full && CONFIG_PREEMPT_RCU:
   Tick doesn't appear to be helpful because:
       - If the rcu_exp_handler() fires in an rcu_read_lock'ed section, then the
         exp QS is reported on rcu_read_unlock()
       - If the rcu_exp_handler() fires in a preempt/bh disabled section,
         TIF_RESCHED is forced which is handled on preempt/bh re-enablement,
	 reporting a QS.
   
  
The case 2) is a niche, only useful for debugging. But anyway I'm not sure it's
worth changing/optimizing the current state. Might be worth add a comment
though.

Thanks.
Paul E. McKenney Jan. 9, 2023, 7:28 p.m. UTC | #7
On Sat, Jan 07, 2023 at 10:39:22PM +0100, Frederic Weisbecker wrote:
> On Wed, Dec 21, 2022 at 12:16:34PM -0800, Paul E. McKenney wrote:
> > On Tue, Dec 20, 2022 at 10:16:25PM +0800, Zqiang wrote:
> > > This commit add TICK_DEP_MASK_RCU_EXP dependency check in
> > > check_tick_dependency(), fix commit df1e849ae4559 ("rcu: Enable
> > > tick for nohz_full CPUs slow to provide expedited QS").
> > > 
> > > Signed-off-by: Zqiang <qiang1.zhang@intel.com>
> > 
> > Again, good eyes, thank you!!!
> > 
> > I have queued this for further review and testing, given that it affects
> > pretty much only RCU.  However, if someone else would rather take it:
> > 
> > Acked-by: Paul E. McKenney <paulmck@kernel.org>
> 
> Thanks for picking it up! Please also add the following tags:
> 
> Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS")
> Acked-by: Frederic Weisbecker <frederic@kernel.org>

Thank you for looking it over!  I will apply these on my next rebase.

							Thanx, Paul
Paul E. McKenney Jan. 9, 2023, 7:32 p.m. UTC | #8
On Mon, Jan 09, 2023 at 12:09:48AM +0100, Frederic Weisbecker wrote:
> On Sat, Jan 07, 2023 at 09:55:22PM -0500, Joel Fernandes wrote:
> > 
> > 
> > > On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > 
> > > 
> > >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > >>> 
> > >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote:
> > >>> (lost html content)
> > > 
> > > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... 
> > > 
> > >> I can't find a place where the exp grace period sends an IPI to
> > >> CPUs slow to report a QS. But anyway you really need the tick to poll
> > >> periodically on the CPU to chase a quiescent state.
> > > 
> > > Ok.
> > > 
> > >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y
> > >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled
> > >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled
> > >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y.
> > >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n
> > >> (and there is also PREEMPT_DYNAMIC to consider).
> > > 
> > > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods.  I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it?
> > > 
> > > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. 
> > > 
> > >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user
> > >> as QS, but those are already reported explicitly on ct_kernel_exit() ->
> > >> rcu_preempt_deferred_qs().
> > > 
> > > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS?  Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel.
> > 
> > I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again.  This RCU thing…
> 
> Right, so when CONFIG_PREEMPT_COUNT=n, there is no way for a tick to tell if the
> the interrupted code is safely considered as a QS. That's because
> preempt_disable() <-> preempt_enable() are no-ops so the whole kernel is
> assumed non-preemptible, and therefore the whole kernel is a READ side critical
> section, except for the explicit points reporting a QS.
> 
> The only exception is when the tick interrupts idle (or user with
> nohz_full). But we already have an exp QS reported on idle (and user with
> nohz_full) entry through ct_kernel_exit(), and that happens on all RCU_TREE
> configs (PREEMPT or not). Therefore the tick doesn't appear to be helpful at
> all on a nohz_full CPU with CONFIG_PREEMPT_COUNT=n.
> 
> I suggest we don't bother optimizing that case though...
> 
> To summarize:
> 
> 1) nohz_full && !CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU:
>   Tick isn't helpful. It can only report idle/user QS, but that is
>   already reported explicitly.
> 
> 2) nohz_full && CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU:
>   Tick is very helpful because it can tell if the kernel is in
>   a QS state.
> 
> 3) nohz_full && CONFIG_PREEMPT_RCU:
>    Tick doesn't appear to be helpful because:
>        - If the rcu_exp_handler() fires in an rcu_read_lock'ed section, then the
>          exp QS is reported on rcu_read_unlock()
>        - If the rcu_exp_handler() fires in a preempt/bh disabled section,
>          TIF_RESCHED is forced which is handled on preempt/bh re-enablement,
> 	 reporting a QS.
>    
>   
> The case 2) is a niche, only useful for debugging. But anyway I'm not sure it's
> worth changing/optimizing the current state. Might be worth add a comment
> though.

Thank you both for the analysis!  I would welcome a comment.

One could argue that we should increase the delay before turning the
tick on, but my experience is that expedited grace periods almost always
complete in less than a jiffy, so there would almost never be any benefit
in doing so.  But if some large NO_HZ_FULL system with long RCU readers
starts having trouble with too-frequent tick enablement, that is one
possible fix.

							Thanx, Paul
Frederic Weisbecker Jan. 9, 2023, 11:55 p.m. UTC | #9
On Mon, Jan 09, 2023 at 11:32:26AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 09, 2023 at 12:09:48AM +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 07, 2023 at 09:55:22PM -0500, Joel Fernandes wrote:
> > > 
> > > 
> > > > On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > 
> > > > 
> > > >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > >>> 
> > > >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote:
> > > >>> (lost html content)
> > > > 
> > > > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... 
> > > > 
> > > >> I can't find a place where the exp grace period sends an IPI to
> > > >> CPUs slow to report a QS. But anyway you really need the tick to poll
> > > >> periodically on the CPU to chase a quiescent state.
> > > > 
> > > > Ok.
> > > > 
> > > >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y
> > > >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled
> > > >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled
> > > >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y.
> > > >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n
> > > >> (and there is also PREEMPT_DYNAMIC to consider).
> > > > 
> > > > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods.  I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it?
> > > > 
> > > > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. 
> > > > 
> > > >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user
> > > >> as QS, but those are already reported explicitly on ct_kernel_exit() ->
> > > >> rcu_preempt_deferred_qs().
> > > > 
> > > > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS?  Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel.
> > > 
> > > I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again.  This RCU thing…
> > 
> > Right, so when CONFIG_PREEMPT_COUNT=n, there is no way for a tick to tell if the
> > the interrupted code is safely considered as a QS. That's because
> > preempt_disable() <-> preempt_enable() are no-ops so the whole kernel is
> > assumed non-preemptible, and therefore the whole kernel is a READ side critical
> > section, except for the explicit points reporting a QS.
> > 
> > The only exception is when the tick interrupts idle (or user with
> > nohz_full). But we already have an exp QS reported on idle (and user with
> > nohz_full) entry through ct_kernel_exit(), and that happens on all RCU_TREE
> > configs (PREEMPT or not). Therefore the tick doesn't appear to be helpful at
> > all on a nohz_full CPU with CONFIG_PREEMPT_COUNT=n.
> > 
> > I suggest we don't bother optimizing that case though...
> > 
> > To summarize:
> > 
> > 1) nohz_full && !CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU:
> >   Tick isn't helpful. It can only report idle/user QS, but that is
> >   already reported explicitly.
> > 
> > 2) nohz_full && CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU:
> >   Tick is very helpful because it can tell if the kernel is in
> >   a QS state.
> > 
> > 3) nohz_full && CONFIG_PREEMPT_RCU:
> >    Tick doesn't appear to be helpful because:
> >        - If the rcu_exp_handler() fires in an rcu_read_lock'ed section, then the
> >          exp QS is reported on rcu_read_unlock()
> >        - If the rcu_exp_handler() fires in a preempt/bh disabled section,
> >          TIF_RESCHED is forced which is handled on preempt/bh re-enablement,
> > 	 reporting a QS.
> >    
> >   
> > The case 2) is a niche, only useful for debugging. But anyway I'm not sure it's
> > worth changing/optimizing the current state. Might be worth add a comment
> > though.
> 
> Thank you both for the analysis!  I would welcome a comment.

I'm preparing that.

> One could argue that we should increase the delay before turning the
> tick on, but my experience is that expedited grace periods almost always
> complete in less than a jiffy, so there would almost never be any benefit
> in doing so.  But if some large NO_HZ_FULL system with long RCU readers
> starts having trouble with too-frequent tick enablement, that is one
> possible fix.

And last but not least: wait for anybody to complain before changing anything
;-))

Thanks.
Paul E. McKenney Jan. 13, 2023, 7:25 p.m. UTC | #10
On Tue, Jan 10, 2023 at 12:55:51AM +0100, Frederic Weisbecker wrote:
> On Mon, Jan 09, 2023 at 11:32:26AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 09, 2023 at 12:09:48AM +0100, Frederic Weisbecker wrote:
> > > On Sat, Jan 07, 2023 at 09:55:22PM -0500, Joel Fernandes wrote:
> > > > > On Jan 7, 2023, at 9:48 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > 
> > > > > 
> > > > >>> On Jan 7, 2023, at 5:11 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > >>> 
> > > > >>> On Fri, Jan 06, 2023 at 07:01:28PM -0500, Joel Fernandes wrote:
> > > > >>> (lost html content)
> > > > > 
> > > > > My problem is the iPhone wises up when I put a web link in an email. I want to look into smtp relays but then if I spent time on fixing that, I might not get time to learn from emails like these... 
> > > > > 
> > > > >> I can't find a place where the exp grace period sends an IPI to
> > > > >> CPUs slow to report a QS. But anyway you really need the tick to poll
> > > > >> periodically on the CPU to chase a quiescent state.
> > > > > 
> > > > > Ok.
> > > > > 
> > > > >> Now arguably it's probably only useful when CONFIG_PREEMPT_COUNT=y
> > > > >> and rcu_exp_handler() has interrupted a preempt-disabled or bh-disabled
> > > > >> section. Although rcu_exp_handler() sets TIF_RESCHED, which is handled
> > > > >> by preempt_enable() and local_bh_enable() when CONFIG_PREEMPT=y.
> > > > >> So probably it's only useful when CONFIG_PREEMPT_COUNT=y and CONFIG_PREEMPT=n
> > > > >> (and there is also PREEMPT_DYNAMIC to consider).
> > > > > 
> > > > > Makes sense. I think I was missing this use case and was going by the general design of exp grace periods.  I was incorrectly assuming the IPIs were being sent repeatedly for hold out CPUs, which is not the case I think. But that would another way to fix it?
> > > > > 
> > > > > But yeah I get your point, the first set of IPIs missed it, so we need the rescue-tick for long non-rcu_read_lock() implicit critical sections.. 
> > > > > 
> > > > >> If CONFIG_PREEMPT_COUNT=n, the tick can only report idle and user
> > > > >> as QS, but those are already reported explicitly on ct_kernel_exit() ->
> > > > >> rcu_preempt_deferred_qs().
> > > > > 
> > > > > Oh hmm, because that function is a NOOP for PREEMPT_COUNT=y and PREEMPT=n and will not report the deferred QS?  Maybe it should then. However I think the tick is still useful if after the preempt disabled section, will still did not exit the kernel.
> > > > 
> > > > I think meant I here, an atomic section (like bh or Irq disabled). There is no such thing as disabling preemption for CONFIG_PREEMPT=n. Or maybe I am confused again.  This RCU thing…
> > > 
> > > Right, so when CONFIG_PREEMPT_COUNT=n, there is no way for a tick to tell if the
> > > the interrupted code is safely considered as a QS. That's because
> > > preempt_disable() <-> preempt_enable() are no-ops so the whole kernel is
> > > assumed non-preemptible, and therefore the whole kernel is a READ side critical
> > > section, except for the explicit points reporting a QS.
> > > 
> > > The only exception is when the tick interrupts idle (or user with
> > > nohz_full). But we already have an exp QS reported on idle (and user with
> > > nohz_full) entry through ct_kernel_exit(), and that happens on all RCU_TREE
> > > configs (PREEMPT or not). Therefore the tick doesn't appear to be helpful at
> > > all on a nohz_full CPU with CONFIG_PREEMPT_COUNT=n.
> > > 
> > > I suggest we don't bother optimizing that case though...
> > > 
> > > To summarize:
> > > 
> > > 1) nohz_full && !CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU:
> > >   Tick isn't helpful. It can only report idle/user QS, but that is
> > >   already reported explicitly.
> > > 
> > > 2) nohz_full && CONFIG_PREEMPT_COUNT && !CONFIG_PREEMPT_RCU:
> > >   Tick is very helpful because it can tell if the kernel is in
> > >   a QS state.
> > > 
> > > 3) nohz_full && CONFIG_PREEMPT_RCU:
> > >    Tick doesn't appear to be helpful because:
> > >        - If the rcu_exp_handler() fires in an rcu_read_lock'ed section, then the
> > >          exp QS is reported on rcu_read_unlock()
> > >        - If the rcu_exp_handler() fires in a preempt/bh disabled section,
> > >          TIF_RESCHED is forced which is handled on preempt/bh re-enablement,
> > > 	 reporting a QS.
> > >    
> > >   
> > > The case 2) is a niche, only useful for debugging. But anyway I'm not sure it's
> > > worth changing/optimizing the current state. Might be worth add a comment
> > > though.
> > 
> > Thank you both for the analysis!  I would welcome a comment.
> 
> I'm preparing that.
> 
> > One could argue that we should increase the delay before turning the
> > tick on, but my experience is that expedited grace periods almost always
> > complete in less than a jiffy, so there would almost never be any benefit
> > in doing so.  But if some large NO_HZ_FULL system with long RCU readers
> > starts having trouble with too-frequent tick enablement, that is one
> > possible fix.
> 
> And last but not least: wait for anybody to complain before changing anything
> ;-))

Well said!  Up to a point, anyway.  ;-)

							Thanx, Paul
diff mbox series

Patch

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 2e713a7d9aa3..3e8619c72f77 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -371,7 +371,8 @@  TRACE_EVENT(itimer_expire,
 		tick_dep_name(PERF_EVENTS)		\
 		tick_dep_name(SCHED)			\
 		tick_dep_name(CLOCK_UNSTABLE)		\
-		tick_dep_name_end(RCU)
+		tick_dep_name(RCU)			\
+		tick_dep_name_end(RCU_EXP)
 
 #undef tick_dep_name
 #undef tick_dep_mask_name
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b0e3c9205946..ba2ac1469d47 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -281,6 +281,11 @@  static bool check_tick_dependency(atomic_t *dep)
 		return true;
 	}
 
+	if (val & TICK_DEP_MASK_RCU_EXP) {
+		trace_tick_stop(0, TICK_DEP_MASK_RCU_EXP);
+		return true;
+	}
+
 	return false;
 }