diff mbox

rcu self-detected stall messages on OMAP3, 4 boards

Message ID alpine.DEB.2.00.1209230136320.28806@utopia.booyaka.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paul Walmsley Sept. 23, 2012, 1:42 a.m. UTC
Hi Paul

On Sat, 22 Sep 2012, Paul Walmsley wrote:

> On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> 
> > And here is a patch.  I am still having trouble reproducing the problem,
> > but figured that I should avoid serializing things.
> 
> Thanks, testing this now on v3.6-rc6.

Looks like you solved it!

Tested v3.6-rc6 + your stall diagnostic patch:

    http://marc.info/?l=linux-arm-kernel&m=134827237215882&w=2

on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.

Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from:

    http://marc.info/?l=linux-arm-kernel&m=134835120600590&w=2

Booted that, and the stall warnings did not appear within 30 minutes.

To confirm that the problem being solved matched your hypothesis, the
debugging patch below[1] was added to the RCU idle entry/exit code.

Without the bugfix patch, a boot log transcript was obtained
indicating that the idle loop was entered with tick_nohz_enabled=1
during a grace period with no callbacks present:

    http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt

The debugging events started to appear at 1.867370 seconds into the
boot.  ENTER was pressed about 464 seconds in; this triggered the
rcu_sched stall traceback.

With the bugfix patch, a boot log transcript was obtained that
indicated that the condition under test never occurred after waiting
about 20 minutes:

    http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt

Thanks for being so willing to root-cause the issue, Paul; it's 
appreciated, and it's been quite instructive as well.  Will address some 
remaining loose ends in follow-up E-mails.


- Paul


[1] Debugging patch to printk() if the previous idle loop entry occurred 
with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
present:


---
 kernel/rcutree.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Paul E. McKenney Sept. 23, 2012, 1:56 a.m. UTC | #1
On Sun, Sep 23, 2012 at 01:42:10AM +0000, Paul Walmsley wrote:
> Hi Paul
> 
> On Sat, 22 Sep 2012, Paul Walmsley wrote:
> 
> > On Sat, 22 Sep 2012, Paul E. McKenney wrote:
> > 
> > > And here is a patch.  I am still having trouble reproducing the problem,
> > > but figured that I should avoid serializing things.
> > 
> > Thanks, testing this now on v3.6-rc6.
> 
> Looks like you solved it!
> 
> Tested v3.6-rc6 + your stall diagnostic patch:
> 
>     http://marc.info/?l=linux-arm-kernel&m=134827237215882&w=2
> 
> on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
> CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.
> 
> Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from:
> 
>     http://marc.info/?l=linux-arm-kernel&m=134835120600590&w=2
> 
> Booted that, and the stall warnings did not appear within 30 minutes.

Very cool, thank you for your testing efforts!!!

May I apply your Tested-by to this patch?

And good show on the debugging patch -- it is quite good to have such
solid evidence that the bug that the fix was intended for was actually
occurring.

						Thanx, Paul

> To confirm that the problem being solved matched your hypothesis, the
> debugging patch below[1] was added to the RCU idle entry/exit code.
> 
> Without the bugfix patch, a boot log transcript was obtained
> indicating that the idle loop was entered with tick_nohz_enabled=1
> during a grace period with no callbacks present:
> 
>     http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt
> 
> The debugging events started to appear at 1.867370 seconds into the
> boot.  ENTER was pressed about 464 seconds in; this triggered the
> rcu_sched stall traceback.
> 
> With the bugfix patch, a boot log transcript was obtained that
> indicated that the condition under test never occurred after waiting
> about 20 minutes:
> 
>     http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt
> 
> Thanks for being so willing to root-cause the issue, Paul; it's 
> appreciated, and it's been quite instructive as well.  Will address some 
> remaining loose ends in follow-up E-mails.
> 
> 
> - Paul
> 
> 
> [1] Debugging patch to printk() if the previous idle loop entry occurred 
> with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
> present:
> 
> 
> ---
>  kernel/rcutree.c |   17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index f1eb7ad..f42941b 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -60,6 +60,9 @@
> 
>  /* Data structures. */
> 
> +extern int tick_nohz_enabled;
> +static int no_cbs_idle_entry_count;
> +
>  static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
> 
>  #define RCU_STATE_INITIALIZER(sname, cr) { \
> @@ -400,8 +403,12 @@ void rcu_idle_enter(void)
>  	unsigned long flags;
>  	long long oldval;
>  	struct rcu_dynticks *rdtp;
> +	int cpu;
> +	long totqlen = 0;
> +	struct rcu_data *rdp;
> 
>  	local_irq_save(flags);
> +	rdp = &__get_cpu_var(rcu_sched_data);
>  	rdtp = &__get_cpu_var(rcu_dynticks);
>  	oldval = rdtp->dynticks_nesting;
>  	WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
> @@ -410,6 +417,12 @@ void rcu_idle_enter(void)
>  	else
>  		rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
>  	rcu_idle_enter_common(rdtp, oldval);
> +	if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) {
> +		for_each_possible_cpu(cpu)
> +			totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen;
> +		if (totqlen == 0)
> +			no_cbs_idle_entry_count = 1;
> +	}
>  	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_enter);
> @@ -503,6 +516,10 @@ void rcu_idle_exit(void)
>  		rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
>  	rcu_idle_exit_common(rdtp, oldval);
>  	local_irq_restore(flags);
> +	if (no_cbs_idle_entry_count) {
> +		no_cbs_idle_entry_count = 0;
> +		pr_err("* Tickless idle was entered with zero RCU callbacks\n");
> +	}
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_exit);
> 
> -- 
> 1.7.10.4
>
Paul Walmsley Sept. 23, 2012, 2:01 a.m. UTC | #2
On Sat, 22 Sep 2012, Paul E. McKenney wrote:

> Very cool, thank you for your testing efforts!!!

You're welcome.

> May I apply your Tested-by to this patch?

Please do:

Tested-by: Paul Walmsley <paul@pwsan.com>  # OMAP4430

Am testing on OMAP3730 (single-core) now.


- Paul
diff mbox

Patch

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f1eb7ad..f42941b 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -60,6 +60,9 @@ 
 
 /* Data structures. */
 
+extern int tick_nohz_enabled;
+static int no_cbs_idle_entry_count;
+
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 
 #define RCU_STATE_INITIALIZER(sname, cr) { \
@@ -400,8 +403,12 @@  void rcu_idle_enter(void)
 	unsigned long flags;
 	long long oldval;
 	struct rcu_dynticks *rdtp;
+	int cpu;
+	long totqlen = 0;
+	struct rcu_data *rdp;
 
 	local_irq_save(flags);
+	rdp = &__get_cpu_var(rcu_sched_data);
 	rdtp = &__get_cpu_var(rcu_dynticks);
 	oldval = rdtp->dynticks_nesting;
 	WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
@@ -410,6 +417,12 @@  void rcu_idle_enter(void)
 	else
 		rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
 	rcu_idle_enter_common(rdtp, oldval);
+	if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) {
+		for_each_possible_cpu(cpu)
+			totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen;
+		if (totqlen == 0)
+			no_cbs_idle_entry_count = 1;
+	}
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -503,6 +516,10 @@  void rcu_idle_exit(void)
 		rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
 	rcu_idle_exit_common(rdtp, oldval);
 	local_irq_restore(flags);
+	if (no_cbs_idle_entry_count) {
+		no_cbs_idle_entry_count = 0;
+		pr_err("* Tickless idle was entered with zero RCU callbacks\n");
+	}
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);