From patchwork Wed Feb 19 10:10:40 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wengang Wang <wen.gang.wang@oracle.com>
X-Patchwork-Id: 3680301
Return-Path: <ocfs2-devel-bounces@oss.oracle.com>
X-Original-To: patchwork-ocfs2-devel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.19.201])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id B729DBF13A
	for <patchwork-ocfs2-devel@patchwork.kernel.org>;
	Wed, 19 Feb 2014 11:11:56 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id BAAC120165
	for <patchwork-ocfs2-devel@patchwork.kernel.org>;
	Wed, 19 Feb 2014 11:11:55 +0000 (UTC)
Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 6F6A820131
	for <patchwork-ocfs2-devel@patchwork.kernel.org>;
	Wed, 19 Feb 2014 11:11:54 +0000 (UTC)
Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237])
	by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2)
	with ESMTP id s1JAAtEt026729
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Wed, 19 Feb 2014 10:10:57 GMT
Received: from oss.oracle.com (oss-external.oracle.com [137.254.96.51])
	by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id
	s1JAAr9F005720
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 19 Feb 2014 10:10:53 GMT
Received: from localhost ([127.0.0.1] helo=oss.oracle.com)
	by oss.oracle.com with esmtp (Exim 4.63)
	(envelope-from <ocfs2-devel-bounces@oss.oracle.com>)
	id 1WG47B-0008LU-I2; Wed, 19 Feb 2014 02:10:53 -0800
Received: from ucsinet21.oracle.com ([156.151.31.93])
	by oss.oracle.com with esmtp (Exim 4.63)
	(envelope-from <wen.gang.wang@oracle.com>) id 1WG473-0008LH-Gz
	for ocfs2-devel@oss.oracle.com; Wed, 19 Feb 2014 02:10:45 -0800
Received: from aserz7021.oracle.com (aserz7021.oracle.com [141.146.126.230])
	by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id
	s1JAAiVi009822
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 19 Feb 2014 10:10:44 GMT
Received: from abhmp0010.oracle.com (abhmp0010.oracle.com [141.146.116.16])
	by aserz7021.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id
	s1JAAih7005378; Wed, 19 Feb 2014 10:10:44 GMT
Received: from dhcp-cn-10-182-69-130.cn.oracle.com (/10.182.64.159)
	by default (Oracle Beehive Gateway v4.0)
	with ESMTP ; Wed, 19 Feb 2014 02:10:42 -0800
Message-ID: <53048320.6050506@oracle.com>
Date: Wed, 19 Feb 2014 18:10:40 +0800
From: Wengang <wen.gang.wang@oracle.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:24.0) Gecko/20100101 Thunderbird/24.2.0
MIME-Version: 1.0
To: Junxiao Bi <junxiao.bi@oracle.com>, ocfs2-devel@oss.oracle.com
References: <1392798631-16510-1-git-send-email-junxiao.bi@oracle.com>
In-Reply-To: <1392798631-16510-1-git-send-email-junxiao.bi@oracle.com>
X-MIME-Autoconverted: from 8bit to quoted-printable by ucsinet21.oracle.com
	id s1JAAiVi009822
Cc: mfasheh@suse.de, joe.jin@oracle.com
Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix recovery hung
X-BeenThere: ocfs2-devel@oss.oracle.com
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: <ocfs2-devel.oss.oracle.com>
List-Unsubscribe: <https://oss.oracle.com/mailman/listinfo/ocfs2-devel>,
	<mailto:ocfs2-devel-request@oss.oracle.com?subject=unsubscribe>
List-Archive: <http://oss.oracle.com/pipermail/ocfs2-devel>
List-Post: <mailto:ocfs2-devel@oss.oracle.com>
List-Help: <mailto:ocfs2-devel-request@oss.oracle.com?subject=help>
List-Subscribe: <https://oss.oracle.com/mailman/listinfo/ocfs2-devel>,
	<mailto:ocfs2-devel-request@oss.oracle.com?subject=subscribe>
Sender: ocfs2-devel-bounces@oss.oracle.com
Errors-To: ocfs2-devel-bounces@oss.oracle.com
X-Source-IP: acsinet21.oracle.com [141.146.126.237]
X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED,
	RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Seems DLM_RECO_STATE_FINALIZE is used to track the situation that the recovery master(node 0) dead (just) after it had finished the dlm recovery. So that the original dead node(node 1) needed not be recovered again. This flag is set only on non recovery master nodes.
In this case, things looks like the recovery master reset "recover_master" and "dead node" after another node(node 2) had started to recover another dead node(node 3). This would happen only on the "old" recovery master(node 0). Note this flag DLM_RECO_STATE_ACTIVE, it's set only on master when it begins to recover and cleared when recovery finished. For an "old" master, checking DLM_RECO_STATE_ACTIVE makes sense. I'd like to suggest:


thanks
wengang

? 2014?02?19? 16:30, Junxiao Bi ??:
> There is a race window in dlm_do_recovery() between dlm_remaster_locks()
> and dlm_reset_recovery() when the recovery master nearly finish the recovery
> process for a dead node. After the master sends FINALIZE_RECO message in
> dlm_remaster_locks(), another node may become the recovery master for another
> dead node, and then send the BEGIN_RECO message to all the nodes included the
> old master, in the handler of this message dlm_begin_reco_handler() of old master,
> dlm->reco.dead_node and dlm->reco.new_master will be set to the second dead
> node and the new master, then in dlm_reset_recovery(), these two variables
> will be reset to default value. This will cause new recovery master can not finish
> the recovery process and hung, at last the whole cluster will hung for recovery.
>
> old recovery master:                                 new recovery master:
> dlm_remaster_locks()
>                                                    become recovery master for
>                                                    another dead node.
>                                                    dlm_send_begin_reco_message()
> dlm_begin_reco_handler()
> {
>   if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
>    return -EAGAIN;
>   }
>   dlm_set_reco_master(dlm, br->node_idx);
>   dlm_set_reco_dead_node(dlm, br->dead_node);
> }
> dlm_reset_recovery()
> {
>   dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
>   dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
> }
>                                                    will hung in dlm_remaster_locks() for
>                                                    request dlm locks info
>
> Before send FINALIZE_RECO message, recovery master should set DLM_RECO_STATE_FINALIZE
> for itself and clear it after the recovery done, this can break the race windows as
> the BEGIN_RECO messages will not be handled before DLM_RECO_STATE_FINALIZE flag is
> cleared.
>
> A similar race may happen between new recovery master and normal node which is in
> dlm_finalize_reco_handler(), also fix it.
>
> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
> ---
>   fs/ocfs2/dlm/dlmrecovery.c |   11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 7035af0..fe58e8b 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -537,7 +537,10 @@ master_here:
>   		/* success!  see if any other nodes need recovery */
>   		mlog(0, "DONE mastering recovery of %s:%u here(this=%u)!\n",
>   		     dlm->name, dlm->reco.dead_node, dlm->node_num);
> -		dlm_reset_recovery(dlm);
> +		spin_lock(&dlm->spinlock);
> +		__dlm_reset_recovery(dlm);
> +		dlm->reco.state &= ~DLM_RECO_STATE_FINALIZE;
> +		spin_unlock(&dlm->spinlock);
>   	}
>   	dlm_end_recovery(dlm);
>   
> @@ -695,6 +698,10 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node)
>   		if (all_nodes_done) {
>   			int ret;
>   
> +			spin_lock(&dlm->spinlock);
> +			dlm->reco.state |= DLM_RECO_STATE_FINALIZE;
> +			spin_unlock(&dlm->spinlock);
> +
>   			/* all nodes are now in DLM_RECO_NODE_DATA_DONE state
>   	 		 * just send a finalize message to everyone and
>   	 		 * clean up */
> @@ -2882,8 +2889,8 @@ int dlm_finalize_reco_handler(struct o2net_msg *msg, u32 len, void *data,
>   				BUG();
>   			}
>   			dlm->reco.state &= ~DLM_RECO_STATE_FINALIZE;
> +			__dlm_reset_recovery(dlm);
>   			spin_unlock(&dlm->spinlock);
> -			dlm_reset_recovery(dlm);
>   			dlm_kick_recovery_thread(dlm);
>   			break;
>   		default:

--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2704,7 +2704,7 @@ int dlm_begin_reco_handler(struct o2net_msg *msg, u32 len, void *data,
                 return 0;
  
         spin_lock(&dlm->spinlock);
-       if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
+       if (dlm->reco.state & (DLM_RECO_STATE_FINALIZE | DLM_RECO_STATE_ACTIVE)) {
                 mlog(0, "%s: node %u wants to recover node %u (%u:%u) "
                      "but this node is in finalize state, waiting on finalize2\n",
                      dlm->name, br->node_idx, br->dead_node,