From patchwork Wed Aug 27 10:57:11 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joseph Qi X-Patchwork-Id: 4787521 Return-Path: X-Original-To: patchwork-ocfs2-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 87CEFC0338 for ; Wed, 27 Aug 2014 10:58:20 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id B1D0520115 for ; Wed, 27 Aug 2014 10:58:19 +0000 (UTC) Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 7A963200D9 for ; Wed, 27 Aug 2014 10:58:18 +0000 (UTC) Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s7RAvxWf004953 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 27 Aug 2014 10:58:00 GMT Received: from oss.oracle.com (oss-external.oracle.com [137.254.96.51]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s7RAvrLi021101 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 27 Aug 2014 10:57:54 GMT Received: from localhost ([127.0.0.1] helo=oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1XMavJ-0000mU-Jw; Wed, 27 Aug 2014 03:57:53 -0700 Received: from ucsinet21.oracle.com ([156.151.31.93]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1XMav3-0000lo-Pq for ocfs2-devel@oss.oracle.com; Wed, 27 Aug 2014 03:57:38 -0700 Received: from userp1030.oracle.com (userp1030.oracle.com [156.151.31.80]) by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s7RAvaMB008793 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 27 Aug 2014 10:57:36 GMT Received: from userp2030.oracle.com (userp2030.oracle.com [156.151.31.89]) by userp1030.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s7RAvaPh016364 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 27 Aug 2014 10:57:36 GMT Received: from pps.filterd (userp2030.oracle.com [127.0.0.1]) by userp2030.oracle.com (8.14.7/8.14.7) with SMTP id s7RAuxbw008578 for ; Wed, 27 Aug 2014 10:57:36 GMT Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [119.145.14.65]) by userp2030.oracle.com with ESMTP id 1p1dv60440-1 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Wed, 27 Aug 2014 10:57:35 +0000 Received: from 172.24.2.119 (EHLO szxeml456-hub.china.huawei.com) ([172.24.2.119]) by szxrg02-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id BYR71425; Wed, 27 Aug 2014 18:57:26 +0800 (CST) Received: from [127.0.0.1] (10.177.24.125) by szxeml456-hub.china.huawei.com (10.82.67.199) with Microsoft SMTP Server id 14.3.158.1; Wed, 27 Aug 2014 18:57:19 +0800 Message-ID: <53FDB987.2010001@huawei.com> Date: Wed, 27 Aug 2014 18:57:11 +0800 From: Joseph Qi User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Andrew Morton X-Originating-IP: [10.177.24.125] X-CFilter-Loop: Reflected X-Proofpoint-Virus-Version: vendor=nai engine=5600 definitions=7542 signatures=670506 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=2 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1408270128 Cc: Mark Fasheh , "ocfs2-devel@oss.oracle.com" Subject: [Ocfs2-devel] [PATCH] ocfs2: fix deadlock between o2hb thread and o2net_wq X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The following case may lead to o2net_wq and o2hb thread deadlock on o2hb_callback_sem. Currently there are 2 nodes say N1, N2 in the cluster. And N2 down, at the same time, N3 tries to join the cluster. So N1 will handle node down (N2) and join (N3) simultaneously. o2hb o2net_wq ->o2hb_do_disk_heartbeat ->o2hb_check_slot ->o2hb_run_event_list ->o2hb_fire_callbacks ->down_write(&o2hb_callback_sem) ->o2net_hb_node_down_cb ->flush_workqueue(o2net_wq) ->o2net_process_message ->dlm_query_join_handler ->o2hb_check_node_heartbeating ->o2hb_fill_node_map ->down_read(&o2hb_callback_sem) No need to take o2hb_callback_sem in dlm_query_join_handler, o2hb_live_lock is enough to protect live node map. Signed-off-by: Joseph Qi --- fs/ocfs2/cluster/heartbeat.c | 19 +++++++++++++++++++ fs/ocfs2/cluster/heartbeat.h | 1 + fs/ocfs2/dlm/dlmdomain.c | 2 +- 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index 7303929..d133854 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -2572,6 +2572,25 @@ int o2hb_check_node_heartbeating(u8 node_num) } EXPORT_SYMBOL_GPL(o2hb_check_node_heartbeating); +int o2hb_check_node_heartbeating_no_sem(u8 node_num) +{ + unsigned long testing_map[BITS_TO_LONGS(O2NM_MAX_NODES)]; + unsigned long flags; + + spin_lock_irqsave(&o2hb_live_lock, flags); + o2hb_fill_node_map_from_callback(testing_map, sizeof(testing_map)); + spin_unlock_irqrestore(&o2hb_live_lock, flags); + if (!test_bit(node_num, testing_map)) { + mlog(ML_HEARTBEAT, + "node (%u) does not have heartbeating enabled.\n", + node_num); + return 0; + } + + return 1; +} +EXPORT_SYMBOL_GPL(o2hb_check_node_heartbeating_no_sem); + int o2hb_check_node_heartbeating_from_callback(u8 node_num) { unsigned long testing_map[BITS_TO_LONGS(O2NM_MAX_NODES)]; diff --git a/fs/ocfs2/cluster/heartbeat.h b/fs/ocfs2/cluster/heartbeat.h index 00ad8e8..3ef5137 100644 --- a/fs/ocfs2/cluster/heartbeat.h +++ b/fs/ocfs2/cluster/heartbeat.h @@ -80,6 +80,7 @@ void o2hb_fill_node_map(unsigned long *map, void o2hb_exit(void); int o2hb_init(void); int o2hb_check_node_heartbeating(u8 node_num); +int o2hb_check_node_heartbeating_no_sem(u8 node_num); int o2hb_check_node_heartbeating_from_callback(u8 node_num); int o2hb_check_local_node_heartbeating(void); void o2hb_stop_all_regions(void); diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c index 39efc50..bf86ed4 100644 --- a/fs/ocfs2/dlm/dlmdomain.c +++ b/fs/ocfs2/dlm/dlmdomain.c @@ -839,7 +839,7 @@ static int dlm_query_join_handler(struct o2net_msg *msg, u32 len, void *data, * to back off and try again. This gives heartbeat a chance * to catch up. */ - if (!o2hb_check_node_heartbeating(query->node_idx)) { + if (!o2hb_check_node_heartbeating_no_sem(query->node_idx)) { mlog(0, "node %u is not in our live map yet\n", query->node_idx);