From patchwork Fri Nov 18 22:51:18 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 9437505 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 0E3C760469 for ; Fri, 18 Nov 2016 22:53:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DCF9E29A36 for ; Fri, 18 Nov 2016 22:53:07 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BF88C294FB; Fri, 18 Nov 2016 22:53:07 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 3A10C294FB for ; Fri, 18 Nov 2016 22:53:05 +0000 (UTC) Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id uAIMqMZh003598 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 18 Nov 2016 22:52:22 GMT Received: from oss.oracle.com (oss-old-reserved.oracle.com [137.254.22.2]) by aserv0021.oracle.com (8.13.8/8.14.4) with ESMTP id uAIMq8xm013447 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 18 Nov 2016 22:52:08 GMT Received: from localhost ([127.0.0.1] helo=lb-oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1c7s0u-0003s0-DO; Fri, 18 Nov 2016 14:52:08 -0800 Received: from aserv0021.oracle.com ([141.146.126.233]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1c7s09-0003dU-Py for ocfs2-devel@oss.oracle.com; Fri, 18 Nov 2016 14:51:21 -0800 Received: from aserp1020.oracle.com (aserp1020.oracle.com [141.146.126.67]) by aserv0021.oracle.com (8.13.8/8.14.4) with ESMTP id uAIMpLAZ012266 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Fri, 18 Nov 2016 22:51:21 GMT Received: from userp2030.oracle.com (userp2030.oracle.com [156.151.31.89]) by aserp1020.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id uAIMpKKn016538 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Fri, 18 Nov 2016 22:51:20 GMT Received: from pps.filterd (userp2030.oracle.com [127.0.0.1]) by userp2030.oracle.com (8.16.0.17/8.16.0.17) with SMTP id uAIMl4hd007360 for ; Fri, 18 Nov 2016 22:51:20 GMT Authentication-Results: oracle.com; spf=pass smtp.mailfrom=akpm@linux-foundation.org Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) by userp2030.oracle.com with ESMTP id 26t5rvtth7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Fri, 18 Nov 2016 22:51:20 +0000 Received: from akpm3.mtv.corp.google.com (unknown [104.132.1.73]) by mail.linuxfoundation.org (Postfix) with ESMTPSA id A99597A4; Fri, 18 Nov 2016 22:51:18 +0000 (UTC) Date: Fri, 18 Nov 2016 14:51:18 -0800 From: Andrew Morton To: Gechangwei Message-Id: <20161118145118.037001453d878399f0572a98@linux-foundation.org> In-Reply-To: <63ADC13FD55D6546B7DECE290D39E373220C9A5B@H3CMLB12-EX.srv.huawei-3com.com> References: <63ADC13FD55D6546B7DECE290D39E373220C9A5B@H3CMLB12-EX.srv.huawei-3com.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 X-ServerName: mail.linuxfoundation.org X-Proofpoint-SPF-Result: pass X-Proofpoint-SPF-Record: v=spf1 ip4:140.211.169.12/30 include:_spf.google.com ~all X-Proofpoint-Virus-Version: vendor=nai engine=5800 definitions=8353 signatures=670733 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1611180386 Cc: "'ocfs2-devel@oss.oracle.com' \(ocfs2-devel@oss.oracle.com\)" Subject: Re: [Ocfs2-devel] [PATCH] MLE releases issue. X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Source-IP: aserv0021.oracle.com [141.146.126.233] X-Virus-Scanned: ClamAV using ClamSMTP On Fri, 28 Oct 2016 07:14:20 +0000 Gechangwei wrote: > Hi, > During my test on OCFS2 suffering a storage failure, a crash issue was found. > Below was the call trace when crashed. > >From the call trace, we can see a MLE's reference count is going to be negative, which aroused a BUG_ON() > > [143355.593258] Call Trace: > [143355.593268] [] dlm_put_mle_inuse+0x47/0x70 [ocfs2_dlm] > [143355.593276] [] dlm_get_lock_resource+0xac5/0x10d0 [ocfs2_dlm] > [143355.593286] [] ? ip_queue_xmit+0x14a/0x3d0 > [143355.593292] [] ? kmem_cache_alloc+0x1e4/0x220 > [143355.593300] [] ? dlm_wait_for_recovery+0x6c/0x190 [ocfs2_dlm] > [143355.593311] [] dlmlock+0x62d/0x16e0 [ocfs2_dlm] > [143355.593316] [] ? __alloc_skb+0x9b/0x2b0 > [143355.593323] [] ? 0xffffffffc01f6000 > > > I think I probably have found the root cause of this issue. Please > > **Node 1** **Node 2** > Storage failure > An assert master message is sent to Node 1 > Treat Node2 as down > Assert master handler > Decrease MLE reference count > Clean blocked MLE > Decrease MLE reference count > > > In the above scenario, both dlm_assert_master_handler and dlm_clean_block_mle will decease MLE > reference count, thus, in the following get_resouce procedure, the reference count is going to be negative. > > I propose a patch to solve this, please take review if you have any time. > I don't think I've seen any discussion of this patch? I'll queue it up for testing in the meanwhile. > --- > dlm/dlmmaster.c | 8 +++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/dlm/dlmmaster.c b/dlm/dlmmaster.c > index b747854..0540414 100644 > --- a/dlm/dlmmaster.c > +++ b/dlm/dlmmaster.c > @@ -2020,7 +2020,7 @@ ok: > > spin_lock(&mle->spinlock); > if (mle->type == DLM_MLE_BLOCK || mle->type == DLM_MLE_MIGRATION) > - extra_ref = 1; > + extra_ref = test_bit(assert->node_idx, mle->maybe_map) ? 1 : 0; > else { > /* MASTER mle: if any bits set in the response map > * then the calling node needs to re-assert to clear > @@ -3465,12 +3465,18 @@ static void dlm_clean_block_mle(struct dlm_ctxt *dlm, > mlog(0, "mle found, but dead node %u would not have been " > "master\n", dead_node); > spin_unlock(&mle->spinlock); > + } else if(mle->master != O2NM_MAX_NODES){ > + mlog(ML_NOTICE, "mle found, master assert received, master has " > + "already set to %d.\n ", mle->master); > + spin_unlock(&mle->spinlock); > } else { > /* Must drop the refcount by one since the assert_master will > * never arrive. This may result in the mle being unlinked and > * freed, but there may still be a process waiting in the > * dlmlock path which is fine. */ > mlog(0, "node %u was expected master\n", dead_node); > + clear_bit(bit, mle->maybe_map); > atomic_set(&mle->woken, 1); > spin_unlock(&mle->spinlock); > wake_up(&mle->wq); There are quite a lot of issues here. - The patch headers should be in `patch -p1' form. So with "a/fs/ocfs2/dlm/dlmmaster.c", not "a/dlm/dlmmaster.c". - Your email client makes a big mess: strange character encoding, tabs replaced with spaces, etc. Please figure out how to send text/plain emails. Mail a patch to yourself, check that it can still be applied. - There are a few conding style issues. Fixes: --- a/fs/ocfs2/dlm/dlmmaster.c~mle-releases-issue-fix +++ a/fs/ocfs2/dlm/dlmmaster.c @@ -1935,7 +1935,7 @@ ok: spin_lock(&mle->spinlock); if (mle->type == DLM_MLE_BLOCK || mle->type == DLM_MLE_MIGRATION) - extra_ref = test_bit(assert->node_idx, mle->maybe_map) ? 1 : 0; + extra_ref = test_bit(assert->node_idx, mle->maybe_map); else { /* MASTER mle: if any bits set in the response map * then the calling node needs to re-assert to clear @@ -3338,7 +3338,7 @@ static void dlm_clean_block_mle(struct d mlog(0, "mle found, but dead node %u would not have been " "master\n", dead_node); spin_unlock(&mle->spinlock); - } else if(mle->master != O2NM_MAX_NODES){ + } else if (mle->master != O2NM_MAX_NODES) { mlog(ML_NOTICE, "mle found, master assert received, master has " "already set to %d.\n ", mle->master); spin_unlock(&mle->spinlock);