From patchwork Wed Mar 19 21:03:10 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 3861281 Return-Path: X-Original-To: patchwork-ocfs2-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 871289F373 for ; Wed, 19 Mar 2014 21:03:50 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 7582B201FA for ; Wed, 19 Mar 2014 21:03:49 +0000 (UTC) Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 36E09201F4 for ; Wed, 19 Mar 2014 21:03:48 +0000 (UTC) Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s2JL3MBb026138 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 19 Mar 2014 21:03:23 GMT Received: from oss.oracle.com (oss-external.oracle.com [137.254.96.51]) by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s2JL3Mw8026628 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 19 Mar 2014 21:03:22 GMT Received: from localhost ([127.0.0.1] helo=oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1WQNdx-0002oX-VA; Wed, 19 Mar 2014 14:03:21 -0700 Received: from ucsinet22.oracle.com ([156.151.31.94]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1WQNdp-0002o1-6O for ocfs2-devel@oss.oracle.com; Wed, 19 Mar 2014 14:03:13 -0700 Received: from aserp1020.oracle.com (aserp1020.oracle.com [141.146.126.67]) by ucsinet22.oracle.com (8.14.5+Sun/8.14.5) with ESMTP id s2JL3CZC026507 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Wed, 19 Mar 2014 21:03:12 GMT Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) by aserp1020.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s2JL3B2B010419 for ; Wed, 19 Mar 2014 21:03:11 GMT Received: from akpm3.mtv.corp.google.com (unknown [216.239.45.95]) by mail.linuxfoundation.org (Postfix) with ESMTPSA id 48FC6B29; Wed, 19 Mar 2014 21:03:11 +0000 (UTC) Date: Wed, 19 Mar 2014 14:03:10 -0700 From: Andrew Morton To: Mark Fasheh Message-Id: <20140319140310.11b4b3992a2e14364c56b107@linux-foundation.org> In-Reply-To: <20140213204829.GA5716@wotan.suse.de> References: <20140124204703.AF70F5A4203@corp2gmr1-2.hot.corp.google.com> <20140212232921.GZ24361@wotan.suse.de> <52FC55B6.7020403@huawei.com> <20140213204829.GA5716@wotan.suse.de> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 X-Flow-Control-Info: class=Pass-to-MM reputation=ipRisk-All ip=140.211.169.12 ct-class=T1 ct-vol1=0 ct-vol2=4 ct-vol3=4 ct-risk=10 ct-spam1=0 ct-spam2=0 ct-bulk=89 rcpts=1 size=6746 X-Sendmail-CM-Score: 0.00% X-Sendmail-CM-Analysis: v=2.1 cv=A/tVYcmG c=1 sm=1 tr=0 a=5MPDoNpceV4HFXFrvkM3CQ==:117 a=5MPDoNpceV4HFXFrvkM3CQ==:17 a=cohmDMVWBQAA:10 a=NEiEQogP1MkA:10 a=kj9zAlcOel0A:10 a=Z4Rwk6OoAAAA:8 a=1XWaLZrsAAAA:8 a=ag1SF4gXAAAA:8 a=i0EeH86SAAAA:8 a=IXr_WNlcAAAA:8 a=iox 4zFpeAAAA:8 a=867H42iAeRG4l5Ms70IA:9 a=mA-NZeHCJAimNrZc:21 a=B9Rid9AW9Ji1tyDB:21 a=CjuIK1q_8ugA:10 a=0kPLrQdw3YYA:10 a=hPjdaMEvmhQA:10 a=T5ZRoNnfl4MA:10 a=n9GBPR9yFnkA:10 a=jbrJJM5MRmoA:10 X-Sendmail-CT-RefID: str=0001.0A020207.532A0610.00E6:SCFSTAT19734153, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-Sendmail-CT-Classification: not spam Cc: ocfs2-devel@oss.oracle.com Subject: Re: [Ocfs2-devel] [patch 04/11] ocfs2: fix a tiny race when running dirop_fileop_racer X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Source-IP: acsinet21.oracle.com [141.146.126.237] X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Thu, 13 Feb 2014 12:48:29 -0800 Mark Fasheh wrote: > On Thu, Feb 13, 2014 at 01:18:46PM +0800, Joseph Qi wrote: > > On 2014/2/13 7:29, Mark Fasheh wrote: > > >> @@ -1097,6 +1174,22 @@ static int ocfs2_rename(struct inode *ol > > >> goto bail; > > >> } > > >> rename_lock = 1; > > >> + > > >> + /* here we cannot guarantee the inodes haven't just been > > >> + * changed, so check if they are nested again */ > > >> + status = ocfs2_check_if_ancestor(osb, new_dir->i_ino, > > >> + old_inode->i_ino); > > >> + if (status < 0) { > > >> + mlog_errno(status); > > >> + goto bail; > > >> + } else if (status == 1) { > > >> + status = -EPERM; > > >> + mlog(ML_ERROR, "src inode %llu should not be ancestor " > > >> + "of new dir inode %llu\n", > > >> + (unsigned long long)old_inode->i_ino, > > >> + (unsigned long long)new_dir->i_ino); > > > > > > Is it possible for the user to trigger this mlog(ML_ERROR, "....") print at > > > will? If so we need to make it a debug print otherwise we risk blowing up > > > systemlog when someone abuses rename(). > > > --Mark > > > > > > -- > > > Mark Fasheh > > > > > > > > The nested condition can be constructed but it is rare, isn't it? > > And only one system log for one rename, so we log it as error message. > > It's not the rarity of it happening "naturally" that I'm worried about. If > arguments to rename() can be constructed such that they trigger the print > then a misbehaving user or program can flood the system log with repeating > messages. We don't want to leave holes like that exposed - I can speak from > experience that it results in angry system admins :) We're still awaiting a resolution here... From: Yiwen Jiang Subject: ocfs2: fix a tiny race when running dirop_fileop_racer When running dirop_fileop_racer we found a dead lock case. 2 nodes, say Node A and Node B, mount the same ocfs2 volume. Create /race/16/1 in the filesystem, and let the inode number of dir 16 is less than the inode number of dir race. Node A Node B mv /race/16/1 /race/ right after Node A has got the EX mode of /race/16/, and tries to get EX mode of /race ls /race/16/ In this case, Node A has got the EX mode of /race/16/, and wants to get EX mode of /race/. Node B has got the PR mode of /race/, and wants to get the PR mode of /race/16/. Since EX and PR are mutually exclusive, dead lock happens. This patch fixes this case by locking in ancestor order before trying inode number order. Signed-off-by: Yiwen Jiang Signed-off-by: Joseph Qi Cc: Joel Becker Cc: Mark Fasheh Signed-off-by: Andrew Morton --- fs/ocfs2/namei.c | 97 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 95 insertions(+), 2 deletions(-) diff -puN fs/ocfs2/namei.c~ocfs2-fix-a-tiny-race-when-running-dirop_fileop_racer fs/ocfs2/namei.c --- a/fs/ocfs2/namei.c~ocfs2-fix-a-tiny-race-when-running-dirop_fileop_racer +++ a/fs/ocfs2/namei.c @@ -995,6 +995,65 @@ leave: return status; } +static int ocfs2_check_if_ancestor(struct ocfs2_super *osb, + u64 src_inode_no, u64 dest_inode_no) +{ + int ret = 0, i = 0; + u64 parent_inode_no = 0; + u64 child_inode_no = src_inode_no; + struct inode *child_inode; + +#define MAX_LOOKUP_TIMES 32 + while (1) { + child_inode = ocfs2_iget(osb, child_inode_no, 0, 0); + if (IS_ERR(child_inode)) { + ret = PTR_ERR(child_inode); + break; + } + + ret = ocfs2_inode_lock(child_inode, NULL, 0); + if (ret < 0) { + iput(child_inode); + if (ret != -ENOENT) + mlog_errno(ret); + break; + } + + ret = ocfs2_lookup_ino_from_name(child_inode, "..", 2, + &parent_inode_no); + ocfs2_inode_unlock(child_inode, 0); + iput(child_inode); + if (ret < 0) { + ret = -ENOENT; + break; + } + + if (parent_inode_no == dest_inode_no) { + ret = 1; + break; + } + + if (parent_inode_no == osb->root_inode->i_ino) { + ret = 0; + break; + } + + child_inode_no = parent_inode_no; + + if (++i >= MAX_LOOKUP_TIMES) { + mlog(ML_NOTICE, "max lookup times reached, filesystem " + "may have nested directories, " + "src inode: %llu, dest inode: %llu.\n", + (unsigned long long)src_inode_no, + (unsigned long long)dest_inode_no); + ret = 0; + break; + } + } + + return ret; +} + /* * The only place this should be used is rename! * if they have the same id, then the 1st one is the only one locked. @@ -1006,6 +1065,7 @@ static int ocfs2_double_lock(struct ocfs struct inode *inode2) { int status; + int inode1_is_ancestor, inode2_is_ancestor; struct ocfs2_inode_info *oi1 = OCFS2_I(inode1); struct ocfs2_inode_info *oi2 = OCFS2_I(inode2); struct buffer_head **tmpbh; @@ -1019,9 +1079,26 @@ static int ocfs2_double_lock(struct ocfs if (*bh2) *bh2 = NULL; - /* we always want to lock the one with the lower lockid first. */ + /* we always want to lock the one with the lower lockid first. + * and if they are nested, we lock ancestor first */ if (oi1->ip_blkno != oi2->ip_blkno) { - if (oi1->ip_blkno < oi2->ip_blkno) { + inode1_is_ancestor = ocfs2_check_if_ancestor(osb, oi2->ip_blkno, + oi1->ip_blkno); + if (inode1_is_ancestor < 0) { + status = inode1_is_ancestor; + goto bail; + } + + inode2_is_ancestor = ocfs2_check_if_ancestor(osb, oi1->ip_blkno, + oi2->ip_blkno); + if (inode2_is_ancestor < 0) { + status = inode2_is_ancestor; + goto bail; + } + + if ((inode1_is_ancestor == 1) || + (oi1->ip_blkno < oi2->ip_blkno && + inode2_is_ancestor == 0)) { /* switch id1 and id2 around */ tmpbh = bh2; bh2 = bh1; @@ -1138,6 +1215,22 @@ static int ocfs2_rename(struct inode *ol goto bail; } rename_lock = 1; + + /* here we cannot guarantee the inodes haven't just been + * changed, so check if they are nested again */ + status = ocfs2_check_if_ancestor(osb, new_dir->i_ino, + old_inode->i_ino); + if (status < 0) { + mlog_errno(status); + goto bail; + } else if (status == 1) { + status = -EPERM; + mlog(ML_ERROR, "src inode %llu should not be ancestor " + "of new dir inode %llu\n", + (unsigned long long)old_inode->i_ino, + (unsigned long long)new_dir->i_ino); + goto bail; + } } /* if old and new are the same, this'll just do one lock. */