From patchwork Tue May 24 18:13:12 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Marc Zyngier X-Patchwork-Id: 812842 Received: from bombadil.infradead.org (bombadil.infradead.org [18.85.46.34]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p4OIEJ5r026770 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Tue, 24 May 2011 18:14:41 GMT Received: from canuck.infradead.org ([2001:4978:20e::1]) by bombadil.infradead.org with esmtps (Exim 4.76 #1 (Red Hat Linux)) id 1QOw5r-0000Jx-1X; Tue, 24 May 2011 18:12:35 +0000 Received: from localhost ([127.0.0.1] helo=canuck.infradead.org) by canuck.infradead.org with esmtp (Exim 4.76 #1 (Red Hat Linux)) id 1QOw5p-0002Ag-FE; Tue, 24 May 2011 18:12:33 +0000 Received: from service87.mimecast.com ([94.185.240.25]) by canuck.infradead.org with smtp (Exim 4.76 #1 (Red Hat Linux)) id 1QOw5l-0002AN-2N for linux-arm-kernel@lists.infradead.org; Tue, 24 May 2011 18:12:30 +0000 Received: from cam-owa2.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Tue, 24 May 2011 19:12:25 +0100 Received: from [10.1.67.29] ([10.1.255.212]) by cam-owa2.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 24 May 2011 19:12:35 +0100 Subject: [BUG] "sched: Remove rq->lock from the first half of ttwu()" locks up on ARM From: Marc Zyngier To: Peter Zijlstra Organization: ARM Ltd Date: Tue, 24 May 2011 19:13:12 +0100 Message-ID: <1306260792.27474.133.camel@e102391-lin.cambridge.arm.com> Mime-Version: 1.0 X-Mailer: Evolution 2.28.3 X-OriginalArrivalTime: 24 May 2011 18:12:35.0636 (UTC) FILETIME=[287B3740:01CC1A3E] X-MC-Unique: 111052419122501201 X-CRM114-Version: 20090807-BlameThorstenAndJenny ( TRE 0.7.6 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20110524_141229_380097_AAE094FA X-CRM114-Status: GOOD ( 12.89 ) X-Spam-Score: -0.7 (/) X-Spam-Report: SpamAssassin version 3.3.1 on canuck.infradead.org summary: Content analysis details: (-0.7 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at http://www.dnswl.org/, low trust [94.185.240.25 listed in list.dnswl.org] Cc: Frank Rowand , Ingo Molnar , linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: linux-arm-kernel-bounces@lists.infradead.org Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter1.kernel.org [140.211.167.41]); Tue, 24 May 2011 18:14:41 +0000 (UTC) Peter, I've experienced all kind of lock-ups on ARM SMP platforms recently, and finally tracked it down to the following patch: e4a52bcb9a18142d79e231b6733cabdbf2e67c1f [sched: Remove rq->lock from the first half of ttwu()]. Even on moderate load, the machine locks up, often silently, and sometimes with a few messages like: INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 0} (detected by 1, t=12002 jiffies) Another side effect of this patch is that the load average is always 0, whatever load I throw at the system. Reverting the sched changes up to that patch (included) gives me a working system again, which happily survives parallel kernel compilations without complaining. My knowledge of the scheduler being rather limited, I haven't been able to pinpoint the exact problem (though it probably have something to do with __ARCH_WANT_INTERRUPTS_ON_CTXSW being defined on ARM). The enclosed patch somehow papers over the load average problem, but the system ends up locking up anyway: I'd be happy to test any patch you may have. Cheers, M. diff --git a/kernel/sched.c b/kernel/sched.c index d3ade54..5ab43c4 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -2526,8 +2526,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * to spin on ->on_cpu if p is current, since that would * deadlock. */ - if (p == current) + if (p == current) { + p->sched_contributes_to_load = !!task_contributes_to_load(p); + p->state = TASK_WAKING; + if (p->sched_class->task_waking) + p->sched_class->task_waking(p); goto out_activate; + } #endif cpu_relax(); }