From patchwork Mon Sep 7 15:37:26 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Roland Dreier X-Patchwork-Id: 46124 Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n87FbTuC029025 for ; Mon, 7 Sep 2009 15:37:29 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753694AbZIGPhZ (ORCPT ); Mon, 7 Sep 2009 11:37:25 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753752AbZIGPhZ (ORCPT ); Mon, 7 Sep 2009 11:37:25 -0400 Received: from sj-iport-6.cisco.com ([171.71.176.117]:51682 "EHLO sj-iport-6.cisco.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753694AbZIGPhZ (ORCPT ); Mon, 7 Sep 2009 11:37:25 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApoEAHvFpEqrR7MV/2dsb2JhbADBdYhDAY47BYQY X-IronPort-AV: E=Sophos;i="4.44,347,1249257600"; d="scan'208";a="383689810" Received: from sj-dkim-1.cisco.com ([171.71.179.21]) by sj-iport-6.cisco.com with ESMTP; 07 Sep 2009 15:37:27 +0000 Received: from sj-core-2.cisco.com (sj-core-2.cisco.com [171.71.177.254]) by sj-dkim-1.cisco.com (8.12.11/8.12.11) with ESMTP id n87FbR5W016423; Mon, 7 Sep 2009 08:37:27 -0700 Received: from xbh-sjc-221.amer.cisco.com (xbh-sjc-221.cisco.com [128.107.191.63]) by sj-core-2.cisco.com (8.13.8/8.14.3) with ESMTP id n87FbRkn010067; Mon, 7 Sep 2009 15:37:27 GMT Received: from xfe-sjc-212.amer.cisco.com ([171.70.151.187]) by xbh-sjc-221.amer.cisco.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 7 Sep 2009 08:37:27 -0700 Received: from roland-conroe ([10.33.42.9]) by xfe-sjc-212.amer.cisco.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 7 Sep 2009 08:37:27 -0700 Received: by roland-conroe (Postfix, from userid 33217) id A7D45E71D8; Mon, 7 Sep 2009 08:37:26 -0700 (PDT) From: Roland Dreier To: Bart Van Assche , linux-rdma@vger.kernel.org, general@lists.openfabrics.org Subject: [NEW PATCH] IB/mad: Fix possible lock-lock-timer deadlock X-Message-Flag: Warning: May contain useful information Date: Mon, 07 Sep 2009 08:37:26 -0700 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.91 (gnu/linux) MIME-Version: 1.0 X-OriginalArrivalTime: 07 Sep 2009 15:37:27.0095 (UTC) FILETIME=[1A649470:01CA2FD1] DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; l=3205; t=1252337847; x=1253201847; c=relaxed/simple; s=sjdkim1004; h=Content-Type:From:Subject:Content-Transfer-Encoding:MIME-Version; d=cisco.com; i=rdreier@cisco.com; z=From:=20Roland=20Dreier=20 |Subject:=20[NEW=20PATCH]=20IB/mad=3A=20Fix=20possible=20lo ck-lock-timer=20deadlock |Sender:=20; bh=ondbNsH9b8KDl8P19jsnMAbmg1tIW6eLTyG9jNxzQAE=; b=uWF0IZwpHkiTyyftjmkf6MYbstKkY0LgQ6GALtXVmUkH0UH2cBO35tvu1q rI4h08ftG+/Fld+YIh1mZGZXWUhPnAa3G1KanuNfCIPSQJrrYICBT+LHMzZw 2Q576PKljmCbBe6QQxIXS7BD4jpcwamlA0g3uzCih57646MEY4LiA=; Authentication-Results: sj-dkim-1; header.From=rdreier@cisco.com; dkim=pass ( sig from cisco.com/sjdkim1004 verified; ); Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org A new interface was added to the core workqueue API to make handling cancel_delayed_work() deadlocks easier, so a simpler fix for bug 13757 as below becomes possible. Bart, it would be great if you could retest this, since it is what I am planning on sending upstream for 2.6.31. (This patch depends on 4e49627b, "workqueues: introduce __cancel_delayed_work()", which was merged for 2.6.31-rc9; alternatively my for-next branch is now rebased on top of -rc9 and has this patch plus everything else queued for 2.6.32). Thanks, Roland Lockdep reported a possible deadlock with cm_id_priv->lock, mad_agent_priv->lock and mad_agent_priv->timed_work.timer; this happens because the mad module does cancel_delayed_work(&mad_agent_priv->timed_work); while holding mad_agent_priv->lock. cancel_delayed_work() internally does del_timer_sync(&mad_agent_priv->timed_work.timer). This can turn into a deadlock because mad_agent_priv->lock is taken inside cm_id_priv->lock, so we can get the following set of contexts that deadlock each other: A: holding cm_id_priv->lock, waiting for mad_agent_priv->lock B: holding mad_agent_priv->lock, waiting for del_timer_sync() C: interrupt during mad_agent_priv->timed_work.timer that takes cm_id_priv->lock Fix this by using the new __cancel_delayed_work() interface (which internally does del_timer() instead of del_timer_sync()) in all the places where we are holding a lock. Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=13757 Reported-by: Bart Van Assche Signed-off-by: Roland Dreier --- drivers/infiniband/core/mad.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index de922a0..bc30c00 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1974,7 +1974,7 @@ static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) unsigned long delay; if (list_empty(&mad_agent_priv->wait_list)) { - cancel_delayed_work(&mad_agent_priv->timed_work); + __cancel_delayed_work(&mad_agent_priv->timed_work); } else { mad_send_wr = list_entry(mad_agent_priv->wait_list.next, struct ib_mad_send_wr_private, @@ -1983,7 +1983,7 @@ static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv) if (time_after(mad_agent_priv->timeout, mad_send_wr->timeout)) { mad_agent_priv->timeout = mad_send_wr->timeout; - cancel_delayed_work(&mad_agent_priv->timed_work); + __cancel_delayed_work(&mad_agent_priv->timed_work); delay = mad_send_wr->timeout - jiffies; if ((long)delay <= 0) delay = 1; @@ -2023,7 +2023,7 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr) /* Reschedule a work item if we have a shorter timeout */ if (mad_agent_priv->wait_list.next == &mad_send_wr->agent_list) { - cancel_delayed_work(&mad_agent_priv->timed_work); + __cancel_delayed_work(&mad_agent_priv->timed_work); queue_delayed_work(mad_agent_priv->qp_info->port_priv->wq, &mad_agent_priv->timed_work, delay); }