From patchwork Wed Mar 19 21:03:10 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Andrew Morton <akpm@linux-foundation.org>
X-Patchwork-Id: 3861281
Return-Path: <ocfs2-devel-bounces@oss.oracle.com>
X-Original-To: patchwork-ocfs2-devel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.19.201])
	by patchwork1.web.kernel.org (Postfix) with ESMTP id 871289F373
	for <patchwork-ocfs2-devel@patchwork.kernel.org>;
	Wed, 19 Mar 2014 21:03:50 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 7582B201FA
	for <patchwork-ocfs2-devel@patchwork.kernel.org>;
	Wed, 19 Mar 2014 21:03:49 +0000 (UTC)
Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 36E09201F4
	for <patchwork-ocfs2-devel@patchwork.kernel.org>;
	Wed, 19 Mar 2014 21:03:48 +0000 (UTC)
Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237])
	by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2)
	with ESMTP id s2JL3MBb026138
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Wed, 19 Mar 2014 21:03:23 GMT
Received: from oss.oracle.com (oss-external.oracle.com [137.254.96.51])
	by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id
	s2JL3Mw8026628
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 19 Mar 2014 21:03:22 GMT
Received: from localhost ([127.0.0.1] helo=oss.oracle.com)
	by oss.oracle.com with esmtp (Exim 4.63)
	(envelope-from <ocfs2-devel-bounces@oss.oracle.com>)
	id 1WQNdx-0002oX-VA; Wed, 19 Mar 2014 14:03:21 -0700
Received: from ucsinet22.oracle.com ([156.151.31.94])
	by oss.oracle.com with esmtp (Exim 4.63)
	(envelope-from <akpm@linux-foundation.org>) id 1WQNdp-0002o1-6O
	for ocfs2-devel@oss.oracle.com; Wed, 19 Mar 2014 14:03:13 -0700
Received: from aserp1020.oracle.com (aserp1020.oracle.com [141.146.126.67])
	by ucsinet22.oracle.com (8.14.5+Sun/8.14.5) with ESMTP id
	s2JL3CZC026507
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL)
	for <ocfs2-devel@oss.oracle.com>; Wed, 19 Mar 2014 21:03:12 GMT
Received: from mail.linuxfoundation.org (mail.linuxfoundation.org
	[140.211.169.12])
	by aserp1020.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with
	ESMTP id s2JL3B2B010419
	for <ocfs2-devel@oss.oracle.com>; Wed, 19 Mar 2014 21:03:11 GMT
Received: from akpm3.mtv.corp.google.com (unknown [216.239.45.95])
	by mail.linuxfoundation.org (Postfix) with ESMTPSA id 48FC6B29;
	Wed, 19 Mar 2014 21:03:11 +0000 (UTC)
Date: Wed, 19 Mar 2014 14:03:10 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: Mark Fasheh <mfasheh@suse.de>
Message-Id: <20140319140310.11b4b3992a2e14364c56b107@linux-foundation.org>
In-Reply-To: <20140213204829.GA5716@wotan.suse.de>
References: <20140124204703.AF70F5A4203@corp2gmr1-2.hot.corp.google.com>
	<20140212232921.GZ24361@wotan.suse.de>
	<52FC55B6.7020403@huawei.com> <20140213204829.GA5716@wotan.suse.de>
X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu)
Mime-Version: 1.0
X-Flow-Control-Info: class=Pass-to-MM reputation=ipRisk-All ip=140.211.169.12
	ct-class=T1 ct-vol1=0 ct-vol2=4 ct-vol3=4 ct-risk=10
	ct-spam1=0 ct-spam2=0 ct-bulk=89 rcpts=1 size=6746
X-Sendmail-CM-Score: 0.00%
X-Sendmail-CM-Analysis: v=2.1 cv=A/tVYcmG c=1 sm=1 tr=0
	a=5MPDoNpceV4HFXFrvkM3CQ==:117 a=5MPDoNpceV4HFXFrvkM3CQ==:17
	a=cohmDMVWBQAA:10 a=NEiEQogP1MkA:10 a=kj9zAlcOel0A:10
	a=Z4Rwk6OoAAAA:8 a=1XWaLZrsAAAA:8 a=ag1SF4gXAAAA:8
	a=i0EeH86SAAAA:8 a=IXr_WNlcAAAA:8 a=iox
	4zFpeAAAA:8 a=867H42iAeRG4l5Ms70IA:9 a=mA-NZeHCJAimNrZc:21
	a=B9Rid9AW9Ji1tyDB:21 a=CjuIK1q_8ugA:10 a=0kPLrQdw3YYA:10
	a=hPjdaMEvmhQA:10 a=T5ZRoNnfl4MA:10 a=n9GBPR9yFnkA:10
	a=jbrJJM5MRmoA:10
X-Sendmail-CT-RefID: str=0001.0A020207.532A0610.00E6:SCFSTAT19734153, ss=1,
	re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0
X-Sendmail-CT-Classification: not spam
Cc: ocfs2-devel@oss.oracle.com
Subject: Re: [Ocfs2-devel] [patch 04/11] ocfs2: fix a tiny race when running
	dirop_fileop_racer
X-BeenThere: ocfs2-devel@oss.oracle.com
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: <ocfs2-devel.oss.oracle.com>
List-Unsubscribe: <https://oss.oracle.com/mailman/listinfo/ocfs2-devel>,
	<mailto:ocfs2-devel-request@oss.oracle.com?subject=unsubscribe>
List-Archive: <http://oss.oracle.com/pipermail/ocfs2-devel>
List-Post: <mailto:ocfs2-devel@oss.oracle.com>
List-Help: <mailto:ocfs2-devel-request@oss.oracle.com?subject=help>
List-Subscribe: <https://oss.oracle.com/mailman/listinfo/ocfs2-devel>,
	<mailto:ocfs2-devel-request@oss.oracle.com?subject=subscribe>
Sender: ocfs2-devel-bounces@oss.oracle.com
Errors-To: ocfs2-devel-bounces@oss.oracle.com
X-Source-IP: acsinet21.oracle.com [141.146.126.237]
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED,
	T_RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Thu, 13 Feb 2014 12:48:29 -0800 Mark Fasheh <mfasheh@suse.de> wrote:

> On Thu, Feb 13, 2014 at 01:18:46PM +0800, Joseph Qi wrote:
> > On 2014/2/13 7:29, Mark Fasheh wrote:
> > >> @@ -1097,6 +1174,22 @@ static int ocfs2_rename(struct inode *ol
> > >>  			goto bail;
> > >>  		}
> > >>  		rename_lock = 1;
> > >> +
> > >> +		/* here we cannot guarantee the inodes haven't just been
> > >> +		 * changed, so check if they are nested again */
> > >> +		status = ocfs2_check_if_ancestor(osb, new_dir->i_ino,
> > >> +				old_inode->i_ino);
> > >> +		if (status < 0) {
> > >> +			mlog_errno(status);
> > >> +			goto bail;
> > >> +		} else if (status == 1) {
> > >> +			status = -EPERM;
> > >> +			mlog(ML_ERROR, "src inode %llu should not be ancestor "
> > >> +				"of new dir inode %llu\n",
> > >> +				(unsigned long long)old_inode->i_ino,
> > >> +				(unsigned long long)new_dir->i_ino);
> > > 
> > > Is it possible for the user to trigger this mlog(ML_ERROR, "....") print at
> > > will? If so we need to make it a debug print otherwise we risk blowing up
> > > systemlog when someone abuses rename().
> > > 	--Mark
> > > 
> > > --
> > > Mark Fasheh
> > > 
> > > 
> > The nested condition can be constructed but it is rare, isn't it?
> > And only one system log for one rename, so we log it as error message.
> 
> It's not the rarity of it happening "naturally" that I'm worried about. If
> arguments to rename() can be constructed such that they trigger the print
> then a misbehaving user or program can flood the system log with repeating
> messages. We don't want to leave holes like that exposed - I can speak from
> experience that it results in angry system admins :)

We're still awaiting a resolution here...


From: Yiwen Jiang <jiangyiwen@huawei.com>
Subject: ocfs2: fix a tiny race when running dirop_fileop_racer

When running dirop_fileop_racer we found a dead lock case.

2 nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
/race/16/1 in the filesystem, and let the inode number of dir 16 is less
than the inode number of dir race.

Node A                            Node B
mv /race/16/1 /race/
                                  right after Node A has got the
                                  EX mode of /race/16/, and tries to
                                  get EX mode of /race
                                  ls /race/16/

In this case, Node A has got the EX mode of /race/16/, and wants to get EX
mode of /race/.  Node B has got the PR mode of /race/, and wants to get
the PR mode of /race/16/.  Since EX and PR are mutually exclusive, dead
lock happens.

This patch fixes this case by locking in ancestor order before trying
inode number order.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/namei.c |   97 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 95 insertions(+), 2 deletions(-)

diff -puN fs/ocfs2/namei.c~ocfs2-fix-a-tiny-race-when-running-dirop_fileop_racer fs/ocfs2/namei.c
--- a/fs/ocfs2/namei.c~ocfs2-fix-a-tiny-race-when-running-dirop_fileop_racer
+++ a/fs/ocfs2/namei.c
@@ -995,6 +995,65 @@ leave:
 	return status;
 }
 
+static int ocfs2_check_if_ancestor(struct ocfs2_super *osb,
+		u64 src_inode_no, u64 dest_inode_no)
+{
+	int ret = 0, i = 0;
+	u64 parent_inode_no = 0;
+	u64 child_inode_no = src_inode_no;
+	struct inode *child_inode;
+
+#define MAX_LOOKUP_TIMES 32
+	while (1) {
+		child_inode = ocfs2_iget(osb, child_inode_no, 0, 0);
+		if (IS_ERR(child_inode)) {
+			ret = PTR_ERR(child_inode);
+			break;
+		}
+
+		ret = ocfs2_inode_lock(child_inode, NULL, 0);
+		if (ret < 0) {
+			iput(child_inode);
+			if (ret != -ENOENT)
+				mlog_errno(ret);
+			break;
+		}
+
+		ret = ocfs2_lookup_ino_from_name(child_inode, "..", 2,
+				&parent_inode_no);
+		ocfs2_inode_unlock(child_inode, 0);
+		iput(child_inode);
+		if (ret < 0) {
+			ret = -ENOENT;
+			break;
+		}
+
+		if (parent_inode_no == dest_inode_no) {
+			ret = 1;
+			break;
+		}
+
+		if (parent_inode_no == osb->root_inode->i_ino) {
+			ret = 0;
+			break;
+		}
+
+		child_inode_no = parent_inode_no;
+
+		if (++i >= MAX_LOOKUP_TIMES) {
+			mlog(ML_NOTICE, "max lookup times reached, filesystem "
+					"may have nested directories, "
+					"src inode: %llu, dest inode: %llu.\n",
+					(unsigned long long)src_inode_no,
+					(unsigned long long)dest_inode_no);
+			ret = 0;
+			break;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * The only place this should be used is rename!
  * if they have the same id, then the 1st one is the only one locked.
@@ -1006,6 +1065,7 @@ static int ocfs2_double_lock(struct ocfs
 			     struct inode *inode2)
 {
 	int status;
+	int inode1_is_ancestor, inode2_is_ancestor;
 	struct ocfs2_inode_info *oi1 = OCFS2_I(inode1);
 	struct ocfs2_inode_info *oi2 = OCFS2_I(inode2);
 	struct buffer_head **tmpbh;
@@ -1019,9 +1079,26 @@ static int ocfs2_double_lock(struct ocfs
 	if (*bh2)
 		*bh2 = NULL;
 
-	/* we always want to lock the one with the lower lockid first. */
+	/* we always want to lock the one with the lower lockid first.
+	 * and if they are nested, we lock ancestor first */
 	if (oi1->ip_blkno != oi2->ip_blkno) {
-		if (oi1->ip_blkno < oi2->ip_blkno) {
+		inode1_is_ancestor = ocfs2_check_if_ancestor(osb, oi2->ip_blkno,
+				oi1->ip_blkno);
+		if (inode1_is_ancestor < 0) {
+			status = inode1_is_ancestor;
+			goto bail;
+		}
+
+		inode2_is_ancestor = ocfs2_check_if_ancestor(osb, oi1->ip_blkno,
+				oi2->ip_blkno);
+		if (inode2_is_ancestor < 0) {
+			status = inode2_is_ancestor;
+			goto bail;
+		}
+
+		if ((inode1_is_ancestor == 1) ||
+				(oi1->ip_blkno < oi2->ip_blkno &&
+				inode2_is_ancestor == 0)) {
 			/* switch id1 and id2 around */
 			tmpbh = bh2;
 			bh2 = bh1;
@@ -1138,6 +1215,22 @@ static int ocfs2_rename(struct inode *ol
 			goto bail;
 		}
 		rename_lock = 1;
+
+		/* here we cannot guarantee the inodes haven't just been
+		 * changed, so check if they are nested again */
+		status = ocfs2_check_if_ancestor(osb, new_dir->i_ino,
+				old_inode->i_ino);
+		if (status < 0) {
+			mlog_errno(status);
+			goto bail;
+		} else if (status == 1) {
+			status = -EPERM;
+			mlog(ML_ERROR, "src inode %llu should not be ancestor "
+				"of new dir inode %llu\n",
+				(unsigned long long)old_inode->i_ino,
+				(unsigned long long)new_dir->i_ino);
+			goto bail;
+		}
 	}
 
 	/* if old and new are the same, this'll just do one lock. */