From patchwork Thu Jul 31 13:13:31 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Zijlstra X-Patchwork-Id: 4656191 Return-Path: X-Original-To: patchwork-ceph-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 132FDC0338 for ; Thu, 31 Jul 2014 13:14:00 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 659A62011B for ; Thu, 31 Jul 2014 13:13:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A47AD20103 for ; Thu, 31 Jul 2014 13:13:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750973AbaGaNNo (ORCPT ); Thu, 31 Jul 2014 09:13:44 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:36876 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750853AbaGaNNo (ORCPT ); Thu, 31 Jul 2014 09:13:44 -0400 Received: from dhcp-077-248-225-117.chello.nl ([77.248.225.117] helo=twins) by bombadil.infradead.org with esmtpsa (Exim 4.80.1 #2 (Red Hat Linux)) id 1XCqAw-0002Ua-0U; Thu, 31 Jul 2014 13:13:42 +0000 Received: by twins (Postfix, from userid 1000) id 1D05B801B076; Thu, 31 Jul 2014 15:13:31 +0200 (CEST) Date: Thu, 31 Jul 2014 15:13:31 +0200 From: Peter Zijlstra To: Ilya Dryomov Cc: Linux Kernel Mailing List , Ingo Molnar , Ceph Development , davidlohr@hp.com, jason.low2@hp.com Subject: Re: [PATCH] locking/mutexes: Revert "locking/mutexes: Add extra reschedule point" Message-ID: <20140731131331.GT19379@twins.programming.kicks-ass.net> References: <1406801797-20139-1-git-send-email-ilya.dryomov@inktank.com> <20140731115759.GS19379@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2012-12-30) Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Spam-Status: No, score=-7.6 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, T_TVD_MIME_EPI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Thu, Jul 31, 2014 at 04:37:29PM +0400, Ilya Dryomov wrote: > This didn't make sense to me at first too, and I'll be happy to be > proven wrong, but we can reproduce this with rbd very reliably under > higher than usual load, and the revert makes it go away. What we are > seeing in the rbd scenario is the following. This is drivers/block/rbd.c ? I can find but a single mutex_lock() in there. > Suppose foo needs mutexes A and B, bar needs mutex B. foo acquires > A and then wants to acquire B, but B is held by bar. foo spins > a little and ends up calling schedule_preempt_disabled() on line 484 > above, but that call never returns, even though a hundred usecs later > bar releases B. foo ends up stuck in mutex_lock() indefinitely, but > still holds A and everybody else who needs A gets behind A. Given that > this A happens to be a central libceph mutex all rbd activity halts. > Deadlock may not be the best term for this, but never returning from > mutex_lock(&B) even though B has been unlocked is *a* problem. > > This obviously doesn't happen every time schedule_preempt_disabled() on > line 484 is called, so there must be some sort of race here. I'll send > along the actual rbd stack traces shortly. Smells like maybe current->state != TASK_RUNNING, does the below trigger? If so, you've wrecked something in whatever... --- kernel/locking/mutex.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index ae712b25e492..3d726fdaa764 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -473,8 +473,12 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, * reschedule now, before we try-lock the mutex. This avoids getting * scheduled out right after we obtained the mutex. */ - if (need_resched()) + if (need_resched()) { + if (WARN_ON_ONCE(current->state != TASK_RUNNING)) + __set_current_state(TASK_RUNNING); + schedule_preempt_disabled(); + } #endif spin_lock_mutex(&lock->wait_lock, flags);