From patchwork Sun Sep 23 01:42:10 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paul Walmsley X-Patchwork-Id: 1494691 Return-Path: X-Original-To: patchwork-linux-arm@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) by patchwork1.kernel.org (Postfix) with ESMTP id D347A3FD40 for ; Sun, 23 Sep 2012 01:44:51 +0000 (UTC) Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.76 #1 (Red Hat Linux)) id 1TFbD8-0000kW-7n; Sun, 23 Sep 2012 01:42:18 +0000 Received: from utopia.booyaka.com ([74.50.51.50]) by merlin.infradead.org with esmtps (Exim 4.76 #1 (Red Hat Linux)) id 1TFbD4-0000kI-J1 for linux-arm-kernel@lists.infradead.org; Sun, 23 Sep 2012 01:42:15 +0000 Received: (qmail 28986 invoked by uid 1019); 23 Sep 2012 01:42:10 -0000 Date: Sun, 23 Sep 2012 01:42:10 +0000 (UTC) From: Paul Walmsley To: "Paul E. McKenney" Subject: Re: rcu self-detected stall messages on OMAP3, 4 boards In-Reply-To: Message-ID: References: <20120920220130.GN2449@linux.vnet.ibm.com> <20120920232114.GO2449@linux.vnet.ibm.com> <20120921185827.GC2454@linux.vnet.ibm.com> <20120921195717.GD2454@linux.vnet.ibm.com> <20120922201043.GE2934@linux.vnet.ibm.com> <20120922215923.GA13161@linux.vnet.ibm.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 X-Spam-Note: CRM114 invocation failed X-Spam-Score: -1.9 (-) X-Spam-Report: SpamAssassin version 3.3.2 on merlin.infradead.org summary: Content analysis details: (-1.9 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Cc: "Hilman, Kevin" , "" , fweisbec@gmail.com, "Bruce, Becky" , "" , "Paul E. McKenney" , "Shilimkar, Santosh" , "Hunter, Jon" , "" , "" X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: linux-arm-kernel-bounces@lists.infradead.org Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org Hi Paul On Sat, 22 Sep 2012, Paul Walmsley wrote: > On Sat, 22 Sep 2012, Paul E. McKenney wrote: > > > And here is a patch. I am still having trouble reproducing the problem, > > but figured that I should avoid serializing things. > > Thanks, testing this now on v3.6-rc6. Looks like you solved it! Tested v3.6-rc6 + your stall diagnostic patch: http://marc.info/?l=linux-arm-kernel&m=134827237215882&w=2 on OMAP4430ES2 Pandaboard using omap2plus_defconfig and CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings. Then added "rcu: Fix day-one dyntick-idle stall-warning bug" from: http://marc.info/?l=linux-arm-kernel&m=134835120600590&w=2 Booted that, and the stall warnings did not appear within 30 minutes. To confirm that the problem being solved matched your hypothesis, the debugging patch below[1] was added to the RCU idle entry/exit code. Without the bugfix patch, a boot log transcript was obtained indicating that the idle loop was entered with tick_nohz_enabled=1 during a grace period with no callbacks present: http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt The debugging events started to appear at 1.867370 seconds into the boot. ENTER was pressed about 464 seconds in; this triggered the rcu_sched stall traceback. With the bugfix patch, a boot log transcript was obtained that indicated that the condition under test never occurred after waiting about 20 minutes: http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt Thanks for being so willing to root-cause the issue, Paul; it's appreciated, and it's been quite instructive as well. Will address some remaining loose ends in follow-up E-mails. - Paul [1] Debugging patch to printk() if the previous idle loop entry occurred with tick_nohz_enabled=1 during a grace period with no RCU callbacks present: Tested-by: Paul Walmsley # OMAP4430 --- kernel/rcutree.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index f1eb7ad..f42941b 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -60,6 +60,9 @@ /* Data structures. */ +extern int tick_nohz_enabled; +static int no_cbs_idle_entry_count; + static struct lock_class_key rcu_node_class[RCU_NUM_LVLS]; #define RCU_STATE_INITIALIZER(sname, cr) { \ @@ -400,8 +403,12 @@ void rcu_idle_enter(void) unsigned long flags; long long oldval; struct rcu_dynticks *rdtp; + int cpu; + long totqlen = 0; + struct rcu_data *rdp; local_irq_save(flags); + rdp = &__get_cpu_var(rcu_sched_data); rdtp = &__get_cpu_var(rcu_dynticks); oldval = rdtp->dynticks_nesting; WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0); @@ -410,6 +417,12 @@ void rcu_idle_enter(void) else rdtp->dynticks_nesting -= DYNTICK_TASK_NEST_VALUE; rcu_idle_enter_common(rdtp, oldval); + if (tick_nohz_enabled && rcu_gp_in_progress(rdp->rsp)) { + for_each_possible_cpu(cpu) + totqlen += per_cpu_ptr(rdp->rsp->rda, cpu)->qlen; + if (totqlen == 0) + no_cbs_idle_entry_count = 1; + } local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_enter); @@ -503,6 +516,10 @@ void rcu_idle_exit(void) rdtp->dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; rcu_idle_exit_common(rdtp, oldval); local_irq_restore(flags); + if (no_cbs_idle_entry_count) { + no_cbs_idle_entry_count = 0; + pr_err("* Tickless idle was entered with zero RCU callbacks\n"); + } } EXPORT_SYMBOL_GPL(rcu_idle_exit);