From patchwork Tue Oct 17 06:48:21 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Changwei Ge X-Patchwork-Id: 10010867 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id E040860211 for ; Tue, 17 Oct 2017 06:49:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D07CB2879A for ; Tue, 17 Oct 2017 06:49:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C4F2D2879C; Tue, 17 Oct 2017 06:49:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 9FB852879A for ; Tue, 17 Oct 2017 06:49:49 +0000 (UTC) Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id v9H6n8rR009696 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 17 Oct 2017 06:49:09 GMT Received: from oss.oracle.com (oss-old-reserved.oracle.com [137.254.22.2]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id v9H6n4Iq004626 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 17 Oct 2017 06:49:05 GMT Received: from localhost ([127.0.0.1] helo=lb-oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1e4LgW-0003g9-RN; Mon, 16 Oct 2017 23:49:04 -0700 Received: from aserv0022.oracle.com ([141.146.126.234]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1e4LgD-0003fd-Ka for ocfs2-devel@oss.oracle.com; Mon, 16 Oct 2017 23:48:46 -0700 Received: from userp2030.oracle.com (userp2030.oracle.com [156.151.31.89]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id v9H6miwd031681 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO); Tue, 17 Oct 2017 06:48:45 GMT Received: from pps.filterd (userp2030.oracle.com [127.0.0.1]) by userp2030.oracle.com (8.16.0.21/8.16.0.21) with SMTP id v9H6jChV048975; Tue, 17 Oct 2017 06:48:44 GMT Authentication-Results: oracle.com; spf=pass smtp.mailfrom=ge.changwei@h3c.com Received: from h3cmg01-ex.h3c.com (smtp.h3c.com [60.191.123.56]) by userp2030.oracle.com with ESMTP id 2dna0db530-1; Tue, 17 Oct 2017 06:48:43 +0000 Received: from BJHUB01-EX.srv.huawei-3com.com (unknown [10.63.20.169]) by h3cmg01-ex.h3c.com with smtp id 05a5_00b1_a11d1859_ee14_407b_994b_66038f6d7867; Tue, 17 Oct 2017 14:48:39 +0800 Received: from H3CMLB14-EX.srv.huawei-3com.com ([fe80::f804:6772:bd71:f07f]) by BJHUB01-EX.srv.huawei-3com.com ([::1]) with mapi id 14.03.0248.002; Tue, 17 Oct 2017 14:48:23 +0800 From: Changwei Ge To: "ocfs2-devel@oss.oracle.com" , Mark Fasheh , Junxiao Bi , Joseph Qi , Joel Becker Thread-Topic: [PATCH] ocfs2: fix cluster hang after a node dies Thread-Index: AdNHE+vwnEzuiMeoSseeIgMAvvtwzA== Date: Tue, 17 Oct 2017 06:48:21 +0000 Message-ID: <63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com> Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.125.136.231] MIME-Version: 1.0 X-CLX-Shades: MLX X-CLX-Response: 1TFkXGBodEQpMehcaEQpZTRdnZnIRCllJFxpxGhAadwYbEh1xGx4cEBp3Bhg aBhoRClleF2hjeREKSUYXRVhLSUZPdVpYRU5fSV5DRUQeEQpDThd7XWBBQGheEl1/fE1OWVhfUx 5rGxp1Q3ATfBljXnNTQhEKWFwXHwQaBBsbEwcbSBpOGE5LTwUbGgQbGhoEHhIEHxAbHhofGhEKX lkXeG0aUkQRCk1cFxsSHhEKTFoXaGlCTXkRCkxGF2NrEQpDWhccGgQbExsEGxgZBB8cEQpCXhcb EQpEWBcYEQpEXhccEQpESRceEQpCRhdnE21gG1tlQh9+fREKQlwXGhEKQkUXbhlYTF5hAXBSTGE RCkJOF2RCfFpFREFiHWRQEQpCTBdvfl1NGAVdZhpSexEKQmwXZGFPS2BCSBJ4HWcRCkJAF2Rhek sSfh1aclBMEQpNXhcbEQpwaBdlZkccH0hJXExNZhAZGhEKcGgXZ1JGe3JwbHBbHX0QGRoRCnBoF 2EeQF1kSHttZGMaEBkaEQpwaBdjbxplc2VtbF96chAZGhEKcGgXaEFEQh1Zc2J8SGsQGRoRCnBs F21OG29TAUdSSB1zEBkaEQptfhcbEQpYTRdLESA= X-PDR: PASS X-ServerName: smtp.h3c.com X-Proofpoint-SPF-Result: pass X-Proofpoint-SPF-Record: v=spf1 ip4:60.191.123.56 ip4:60.191.123.50 ip4:221.12.31.13 ip4:221.12.31.56 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8686 signatures=668583 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=0 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=207 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1710170096 Cc: "linux-fsdevel@vger.kernel.org" , Vitaly Mayatskih Subject: [Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Source-IP: userv0022.oracle.com [156.151.31.74] X-Virus-Scanned: ClamAV using ClamSMTP When a node dies, other live nodes have to choose a new master for an existed lock resource mastered by the dead node. As for ocfs2/dlm implementation, this is done by function - dlm_move_lockres_to_recovery_list which marks those lock rsources as DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM changes lock resource's master later. So without invoking dlm_move_lockres_to_recovery_list, no master will be choosed after dlm recovery accomplishment since no lock resource can be found through ::resource list. What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock resources mastered a dead node, it will break up synchronization among nodes. So invoke dlm_move_lockres_to_recovery_list again. Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")' Reported-by: Vitaly Mayatskih Signed-off-by: Changwei Ge Reviewed-by: Jun Piao Reviewed-by: Joseph Qi --- fs/ocfs2/dlm/dlmrecovery.c | 1 + 1 file changed, 1 insertion(+) __dlm_lockres_calc_usage(dlm, res); diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 74407c6..ec8f758 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct dlm_ctxt *dlm, u8 dead_node) dlm_lockres_put(res); continue; } + dlm_move_lockres_to_recovery_list(dlm, res); } else if (res->owner == dlm->node_num) { dlm_free_dead_locks(dlm, res, dead_node);