From patchwork Fri Oct 28 07:14:20 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Changwei Ge X-Patchwork-Id: 9401391 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 0238760588 for ; Fri, 28 Oct 2016 07:16:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E37DF2A5DC for ; Fri, 28 Oct 2016 07:16:05 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D762B2A5DE; Fri, 28 Oct 2016 07:16:05 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id F0DC92A5DC for ; Fri, 28 Oct 2016 07:16:03 +0000 (UTC) Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id u9S7EtdP003632 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 28 Oct 2016 07:14:56 GMT Received: from oss.oracle.com (oss-old-reserved.oracle.com [137.254.22.2]) by aserv0022.oracle.com (8.14.4/8.13.8) with ESMTP id u9S7EslX005694 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 28 Oct 2016 07:14:54 GMT Received: from localhost ([127.0.0.1] helo=lb-oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1c01NO-0007Qo-9o; Fri, 28 Oct 2016 00:14:54 -0700 Received: from userv0021.oracle.com ([156.151.31.71]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1c01NM-0007Qd-DO for ocfs2-devel@oss.oracle.com; Fri, 28 Oct 2016 00:14:52 -0700 Received: from aserp1060.oracle.com (aserp1060.oracle.com [141.146.126.71]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id u9S7Ep1G001977 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Fri, 28 Oct 2016 07:14:51 GMT Received: from userp2040.oracle.com (userp2040.oracle.com [156.151.31.90]) by aserp1060.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id u9S7Eodq018398 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Fri, 28 Oct 2016 07:14:51 GMT Received: from pps.filterd (userp2040.oracle.com [127.0.0.1]) by userp2040.oracle.com (8.16.0.17/8.16.0.17) with SMTP id u9S7Cnqp024167 for ; Fri, 28 Oct 2016 07:14:50 GMT Authentication-Results: oracle.com; spf=none smtp.mailfrom=ge.changwei@h3c.com Received: from h3cmg01-ex.h3c.com (smtp.h3c.com [60.191.123.56]) by userp2040.oracle.com with ESMTP id 26a5006qxw-1 for ; Fri, 28 Oct 2016 07:14:49 +0000 Received: from BJHUB01-EX.srv.huawei-3com.com (unknown [10.63.20.169]) by h3cmg01-ex.h3c.com with smtp id 31fc_0603_2a6d7af2_f9db_4254_9590_b9331ecccd0e; Fri, 28 Oct 2016 15:14:41 +0800 Received: from H3CMLB12-EX.srv.huawei-3com.com ([fe80::10fe:abde:731b:fdde]) by BJHUB01-EX.srv.huawei-3com.com ([::1]) with mapi id 14.03.0248.002; Fri, 28 Oct 2016 15:14:21 +0800 From: Gechangwei To: "'ocfs2-devel@oss.oracle.com' (ocfs2-devel@oss.oracle.com)" , "Andrew Morton (akpm@linux-foundation.org)" Thread-Topic: [PATCH] MLE releases issue. Thread-Index: AdIw6kvtf4MR+zweRs2R8IfCj+zP2w== Date: Fri, 28 Oct 2016 07:14:20 +0000 Message-ID: <63ADC13FD55D6546B7DECE290D39E373220C9A5B@H3CMLB12-EX.srv.huawei-3com.com> Accept-Language: en-US, zh-CN Content-Language: zh-CN X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.96.76.110] MIME-Version: 1.0 X-PDR: PASS X-ServerName: smtp.h3c.com X-Proofpoint-SPF-Result: None X-Proofpoint-Virus-Version: vendor=nai engine=5800 definitions=8331 signatures=670692 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1610280125 Cc: "'ocfs2-devel@oss.oracle.com' \(ocfs2-devel@oss.oracle.com\)" Subject: [Ocfs2-devel] [PATCH] MLE releases issue. X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Source-IP: aserv0022.oracle.com [141.146.126.234] X-Virus-Scanned: ClamAV using ClamSMTP Hi, During my test on OCFS2 suffering a storage failure, a crash issue was found. Below was the call trace when crashed. From the call trace, we can see a MLE's reference count is going to be negative, which aroused a BUG_ON() [143355.593258] Call Trace: [143355.593268] [] dlm_put_mle_inuse+0x47/0x70 [ocfs2_dlm] [143355.593276] [] dlm_get_lock_resource+0xac5/0x10d0 [ocfs2_dlm] [143355.593286] [] ? ip_queue_xmit+0x14a/0x3d0 [143355.593292] [] ? kmem_cache_alloc+0x1e4/0x220 [143355.593300] [] ? dlm_wait_for_recovery+0x6c/0x190 [ocfs2_dlm] [143355.593311] [] dlmlock+0x62d/0x16e0 [ocfs2_dlm] [143355.593316] [] ? __alloc_skb+0x9b/0x2b0 [143355.593323] [] ? 0xffffffffc01f6000 I think I probably have found the root cause of this issue. Please **Node 1** **Node 2** Storage failure An assert master message is sent to Node 1 Treat Node2 as down Assert master handler Decrease MLE reference count Clean blocked MLE Decrease MLE reference count In the above scenario, both dlm_assert_master_handler and dlm_clean_block_mle will decease MLE reference count, thus, in the following get_resouce procedure, the reference count is going to be negative. I propose a patch to solve this, please take review if you have any time. Signed-off-by: gechangwei --- dlm/dlmmaster.c | 8 +++++++- 1 file changed, 6 insertions(+), 1 deletion(-) BR. Changwei ------------------------------------------------------------------------------------------------------------------------------------- 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! diff --git a/dlm/dlmmaster.c b/dlm/dlmmaster.c index b747854..0540414 100644 --- a/dlm/dlmmaster.c +++ b/dlm/dlmmaster.c @@ -2020,7 +2020,7 @@ ok: spin_lock(&mle->spinlock); if (mle->type == DLM_MLE_BLOCK || mle->type == DLM_MLE_MIGRATION) - extra_ref = 1; + extra_ref = test_bit(assert->node_idx, mle->maybe_map) ? 1 : 0; else { /* MASTER mle: if any bits set in the response map * then the calling node needs to re-assert to clear @@ -3465,12 +3465,18 @@ static void dlm_clean_block_mle(struct dlm_ctxt *dlm, mlog(0, "mle found, but dead node %u would not have been " "master\n", dead_node); spin_unlock(&mle->spinlock); + } else if(mle->master != O2NM_MAX_NODES){ + mlog(ML_NOTICE, "mle found, master assert received, master has " + "already set to %d.\n ", mle->master); + spin_unlock(&mle->spinlock); } else { /* Must drop the refcount by one since the assert_master will * never arrive. This may result in the mle being unlinked and * freed, but there may still be a process waiting in the * dlmlock path which is fine. */ mlog(0, "node %u was expected master\n", dead_node); + clear_bit(bit, mle->maybe_map); atomic_set(&mle->woken, 1); spin_unlock(&mle->spinlock); wake_up(&mle->wq); --