From patchwork Thu Dec 6 13:20:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: wangjian X-Patchwork-Id: 10715987 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7CC0B13BF for ; Thu, 6 Dec 2018 13:21:35 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 605F72E81C for ; Thu, 6 Dec 2018 13:21:35 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5438B2E8B7; Thu, 6 Dec 2018 13:21:35 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,HTML_MESSAGE, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from aserp2130.oracle.com (aserp2130.oracle.com [141.146.126.79]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 80DD62E865 for ; Thu, 6 Dec 2018 13:21:34 +0000 (UTC) Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wB6DIvqx137183; Thu, 6 Dec 2018 13:21:25 GMT Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2130.oracle.com with ESMTP id 2p3ftfc41u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 06 Dec 2018 13:21:25 +0000 Received: from oss.oracle.com (oss-old-reserved.oracle.com [137.254.22.2]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wB6DLLwL023567 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 6 Dec 2018 13:21:22 GMT Received: from localhost ([127.0.0.1] helo=lb-oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1gUtaj-00058M-Qr; Thu, 06 Dec 2018 05:21:21 -0800 Received: from aserv0021.oracle.com ([141.146.126.233]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1gUtae-00057z-30 for ocfs2-devel@oss.oracle.com; Thu, 06 Dec 2018 05:21:16 -0800 Received: from userp2040.oracle.com (userp2040.oracle.com [156.151.31.90]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id wB6DLFmC023324 (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256 verify=OK) for ; Thu, 6 Dec 2018 13:21:15 GMT Received: from pps.filterd (userp2040.oracle.com [127.0.0.1]) by userp2040.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wB6DE1HC030896 for ; Thu, 6 Dec 2018 13:21:15 GMT Received: from huawei.com (szxga04-in.huawei.com [45.249.212.190]) by userp2040.oracle.com with ESMTP id 2p72ng21w6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Thu, 06 Dec 2018 13:21:14 +0000 Received: from DGGEMS402-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id 008F2322F5C8E; Thu, 6 Dec 2018 21:21:08 +0800 (CST) Received: from [10.177.218.134] (10.177.218.134) by smtp.huawei.com (10.3.19.202) with Microsoft SMTP Server (TLS) id 14.3.408.0; Thu, 6 Dec 2018 21:20:59 +0800 To: "mark@fasheh.com" , "jiangqi903@gmail.com" , , , jiangyiwen From: wangjian Message-ID: Date: Thu, 6 Dec 2018 21:20:53 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.0 MIME-Version: 1.0 Content-Language: en-US X-Originating-IP: [10.177.218.134] X-CFilter-Loop: Reflected X-CLX-Shades: MLX X-CLX-Response: 1TFkXGxIYEQpMehcbExgRCllNF2dmchEKWUkXGnEaEBp3BhscGHEeGxAadwY YGgYaEQpZXhdobmYRCklGF0VYS0lGT3VaWEVOX0leQ0VEGXVPSxEKQ04XTkMYUGJhYAdQclNMTA dZR1JvT1BEX3hnR3NFB3tsGEsRClhcFx8EGgQbHx8HHBITG0kcSE4FGxoEGxoaBB4SBBsQGx4aH xoRCl5ZF356WUhGEQpNXBcbExkRCkxaF2hpQk17EQpFWRdoaxEKQ1oXHh8EGB4TBBgbGAQbExoR CkJeFxsRCkReFx0RCkRJFxsRCkJGF2BTbh5DGXNnWEZmEQpCXBcaEQpCRRdlZF1EaEluREFhbxE KQk4XbEJIWVMaTWV4eB0RCkJMF2BDBV1paEUTTm5+EQpCbBdoXk5rGmIcU0J4YBEKQkAXa31mZ0 ceEgFvXV0RCkJYF2J9b3kBTxgZcHB7EQpNXhcbEQpaWBcYEQpwaBduRh16W09wYnxnbRAZGhEKc GgXaXJITkcSblxAe00QGRoRCnBoF2EdRWNIT0Zre0VcEBkaEQpwaBdvaRtbcFAdU2FGZRAZGhEK cGgXYBlYGh0cR2JcUBsQGRoRCnBsF2QcRgVjWH9wXm1yEBwaEQptfhcbEQpYTRdLESA= X-PDR: PASS X-Source-IP: 45.249.212.190 X-ServerName: szxga04-in.huawei.com X-Proofpoint-SPF-Result: pass X-Proofpoint-SPF-Record: v=spf1 ip4:45.249.212.32 ip4:45.249.212.35 ip4:119.145.14.93 ip4:58.251.152.93 ip4:206.16.17.72 ip4:45.249.212.255 ip4:45.249.212.187/29 ip4:45.249.212.191 ip4:185.176.76.210 ~all X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9098 signatures=668679 X-Proofpoint-DMARC-Record: none X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=192 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=182 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1812060116 X-Spam: Clean Cc: ocfs2-devel@oss.oracle.com Subject: [Ocfs2-devel] [PATCH v2] ocfs2/dlm: Clean DLM_LKSB_GET_LVB and DLM_LKSB_PUT_LVB when the cancel_pending is set X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9098 signatures=668679 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1812060117 X-Virus-Scanned: ClamAV using ClamSMTP We found that a BUG in the dlm_proxy_ast_handler function causes the machine panic. The core information that causes the BUG in the dlm_proxy_ast_handler function is as follows. [ 699.795843] kernel BUG at /usr/src/linux-4.18/fs/ocfs2/dlm/dlmast.c:427! [ 699.801963] Workqueue: o2net o2net_rx_until_empty [ocfs2_nodemanager] [ 699.803275] RIP: 0010:dlm_proxy_ast_handler+0x738/0x740 [ocfs2_dlm] [ 699.808506] RSP: 0018:ffffba64c6f2fd38 EFLAGS: 00010246 [ 699.809456] RAX: ffff9f34a9b39148 RBX: ffff9f30b7af4000 RCX: ffff9f34a9b39148 [ 699.810698] RDX: 000000000000019e RSI: ffffffffc091a930 RDI: ffffba64c6f2fd80 [ 699.811927] RBP: ffff9f2cb7aa3000 R08: ffff9f2cb7b99400 R09: 000000000000001f [ 699.813457] R10: ffff9f34a9249200 R11: ffff9f34af23aa00 R12: 0000000040000000 [ 699.814719] R13: ffff9f34a9249210 R14: 0000000000000002 R15: ffff9f34af23aa28 [ 699.815984] FS: 0000000000000000(0000) GS:ffff9f32b7c00000(0000) knlGS:0000000000000000 [ 699.817417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 699.818825] CR2: 00007fd772f5a140 CR3: 000000005b00a001 CR4: 00000000001606e0 [ 699.820123] Call Trace: [ 699.820658] o2net_rx_until_empty+0x94b/0xcc0 [ocfs2_nodemanager] [ 699.821848] process_one_work+0x171/0x370 [ 699.822595] worker_thread+0x49/0x3f0 [ 699.823301] kthread+0xf8/0x130 [ 699.823972] ? max_active_store+0x80/0x80 [ 699.824881] ? kthread_bind+0x10/0x10 [ 699.825589] ret_from_fork+0x35/0x40 Here is the situation: At the beginning, Node1 is the master of the lock resource and has NL lock, Node2 has PR lock, Node3 has PR lock, Node4 has NL lock. Node1 Node2 Node3 Node4 convert lock_2 from PR to EX. the mode of lock_3 is PR, which blocks the conversion request of Node2. move lock_2 to conversion list. convert lock_3 from PR to EX. move lock_3 to conversion list. send BAST to Node3. receive BAST from Node1. downconvert thread execute canceling convert operation. Node1 dies because the host is powered down. in dlmunlock_common function, the downconvert thread set cancel_pending. at the same time, Node3 realized that Node 1 is dead, so move lock_3 back to granted list in dlm_move_lockres_to_recovery_list function and remove Node 1 from the domain_map in __dlm_hb_node_down function. then downconvert thread failed to send the lock cancellation request to Node1 and return DLM_NORMAL from dlm_send_remote_unlock_request function. become recovery master. during the recovery process, send lock_2 that is converting form PR to EX to Node4. during the recovery process, send lock_3 in the granted list and cantain the DLM_LKSB_GET_LVB flag to Node4. Then downconvert thread delete DLM_LKSB_GET_LVB flag in dlmunlock_common function. Node4 finish recovery. the mode of lock_3 is PR, which blocks the conversion request of Node2, so send BAST to Node3. receive BAST from Node4. convert lock_3 from PR to NL. change the mode of lock_3 from PR to NL and send message to Node3. receive message from Node4. The message contain LKM_GET_LVB flag, but the lock->lksb->flags does not contain DLM_LKSB_GET_LVB, BUG_ON in dlm_proxy_ast_handler function. Function dlm_move_lockres_to_recovery_list should clean DLM_LKSB_GET_LVB and DLM_LKSB_PUT_LVB when the cancel_pending is set. The reasons for clearing the these flags are as follows. First, The owner of the lock resource may have died, the lock has been moved to the grant queue, the purpose of the lock cancellation has been reached, and the LVB flag should be cleared. Second, solve this panic problem. Signed-off-by: Jian Wang Reviewed-by: Yiwen Jiang --- fs/ocfs2/dlm/dlmunlock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ocfs2/dlm/dlmunlock.c b/fs/ocfs2/dlm/dlmunlock.c index 63d701c..6e04fc7 100644 --- a/fs/ocfs2/dlm/dlmunlock.c +++ b/fs/ocfs2/dlm/dlmunlock.c @@ -277,6 +277,7 @@ void dlm_commit_pending_cancel(struct dlm_lock_resource *res, { list_move_tail(&lock->list, &res->granted); lock->ml.convert_type = LKM_IVMODE; + lock->lksb->flags &= ~(DLM_LKSB_GET_LVB|DLM_LKSB_PUT_LVB); }