Message ID | alpine.DEB.2.00.1209230136320.28806@utopia.booyaka.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Sep 23, 2012 at 01:42:10AM +0000, Paul Walmsley wrote: > Hi Paul > > On Sat, 22 Sep 2012, Paul Walmsley wrote: > > > On Sat, 22 Sep 2012, Paul E. McKenney wrote: > > > > > And here is a patch. I am still having trouble reproducing the problem, > > > but figured that I should avoid serializing things. > > > > Thanks, testing this now on v3.6-rc6. > > Looks like you solved it! > > Tested v3.6-rc6 + your stall diagnostic patch: > > http://marc.info/?l=linux-arm-kernel&m=134827237215882&w=2 > > on OMAP4430ES2 Pandaboard using omap2plus_defconfig and > CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings. > > Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from: > > http://marc.info/?l=linux-arm-kernel&m=134835120600590&w=2 > > Booted that, and the stall warnings did not appear within 30 minutes. Very cool, thank you for your testing efforts!!! May I apply your Tested-by to this patch? And good show on the debugging patch -- it is quite good to have such solid evidence that the bug that the fix was intended for was actually occurring. Thanx, Paul > To confirm that the problem being solved matched your hypothesis, the > debugging patch below[1] was added to the RCU idle entry/exit code. > > Without the bugfix patch, a boot log transcript was obtained > indicating that the idle loop was entered with tick_nohz_enabled=1 > during a grace period with no callbacks present: > > http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt > > The debugging events started to appear at 1.867370 seconds into the > boot. ENTER was pressed about 464 seconds in; this triggered the > rcu_sched stall traceback. > > With the bugfix patch, a boot log transcript was obtained that > indicated that the condition under test never occurred after waiting > about 20 minutes: > > http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt > > Thanks for being so willing to root-cause the issue, Paul; it's > appreciated, and it's been quite instructive as well. Will address some > remaining loose ends in follow-up E-mails. > > > - Paul > > > [1] Debugging patch to printk() if the previous idle loop entry occurred > with tick_nohz_enabled=1 during a grace period with no RCU callbacks > present: > > > --- > kernel/rcutree.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/kernel/rcutree.c b/kernel/rcutree.c > index f1eb7ad..f42941b 100644 > --- a/kernel/rcutree.c > +++ b/kernel/rcutree.c > @@ -60,6 +60,9 @@ > > /* Data structures. */ > > +extern int tick_nohz_enabled; > +static int no_cbs_idle_entry_count; > + > static struct lock_class_key rcu_node_class[RCU_NUM_LVLS]; > > #define RCU_STATE_INITIALIZER(sname, cr) { \ > @@ -400,8 +403,12 @@ void rcu_idle_enter(void) > unsigned long flags; > long long oldval; > struct rcu_dynticks *rdtp; > + int cpu; > + long totqlen = 0; > + struct rcu_data *rdp; > > local_irq_save(flags); > + rdp = &__get_cpu_var(rcu_sched_data); > rdtp = &__get_cpu_var(rcu_dynticks); > oldval = rdtp->dynticks_nesting; > WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0); > @@ -410,6 +417,12 @@ void rcu_idle_enter(void) > else > rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE; > rcu_idle_enter_common(rdtp, oldval); > + if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) { > + for_each_possible_cpu(cpu) > + totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen; > + if (totqlen == 0) > + no_cbs_idle_entry_count = 1; > + } > local_irq_restore(flags); > } > EXPORT_SYMBOL_GPL(rcu_idle_enter); > @@ -503,6 +516,10 @@ void rcu_idle_exit(void) > rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; > rcu_idle_exit_common(rdtp, oldval); > local_irq_restore(flags); > + if (no_cbs_idle_entry_count) { > + no_cbs_idle_entry_count = 0; > + pr_err("* Tickless idle was entered with zero RCU callbacks\n"); > + } > } > EXPORT_SYMBOL_GPL(rcu_idle_exit); > > -- > 1.7.10.4 >
On Sat, 22 Sep 2012, Paul E. McKenney wrote: > Very cool, thank you for your testing efforts!!! You're welcome. > May I apply your Tested-by to this patch? Please do: Tested-by: Paul Walmsley <paul@pwsan.com> # OMAP4430 Am testing on OMAP3730 (single-core) now. - Paul
diff --git a/kernel/rcutree.c b/kernel/rcutree.c index f1eb7ad..f42941b 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -60,6 +60,9 @@ /* Data structures. */ +extern int tick_nohz_enabled; +static int no_cbs_idle_entry_count; + static struct lock_class_key rcu_node_class[RCU_NUM_LVLS]; #define RCU_STATE_INITIALIZER(sname, cr) { \ @@ -400,8 +403,12 @@ void rcu_idle_enter(void) unsigned long flags; long long oldval; struct rcu_dynticks *rdtp; + int cpu; + long totqlen = 0; + struct rcu_data *rdp; local_irq_save(flags); + rdp = &__get_cpu_var(rcu_sched_data); rdtp = &__get_cpu_var(rcu_dynticks); oldval = rdtp->dynticks_nesting; WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0); @@ -410,6 +417,12 @@ void rcu_idle_enter(void) else rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE; rcu_idle_enter_common(rdtp, oldval); + if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) { + for_each_possible_cpu(cpu) + totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen; + if (totqlen == 0) + no_cbs_idle_entry_count = 1; + } local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_enter); @@ -503,6 +516,10 @@ void rcu_idle_exit(void) rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; rcu_idle_exit_common(rdtp, oldval); local_irq_restore(flags); + if (no_cbs_idle_entry_count) { + no_cbs_idle_entry_count = 0; + pr_err("* Tickless idle was entered with zero RCU callbacks\n"); + } } EXPORT_SYMBOL_GPL(rcu_idle_exit);