From patchwork Thu Dec 15 09:08:52 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Yan, Zheng" X-Patchwork-Id: 9475693 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 8227C607EE for ; Thu, 15 Dec 2016 09:09:22 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6742A28675 for ; Thu, 15 Dec 2016 09:09:22 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5A79E28681; Thu, 15 Dec 2016 09:09:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ADDA228675 for ; Thu, 15 Dec 2016 09:09:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935738AbcLOJJI convert rfc822-to-8bit (ORCPT ); Thu, 15 Dec 2016 04:09:08 -0500 Received: from mx1.redhat.com ([209.132.183.28]:48802 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935681AbcLOJJE (ORCPT ); Thu, 15 Dec 2016 04:09:04 -0500 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 58D2233E5F2; Thu, 15 Dec 2016 09:08:59 +0000 (UTC) Received: from [10.72.5.234] (vpn1-5-234.pek2.redhat.com [10.72.5.234]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id uBF98rnH008642 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 15 Dec 2016 04:08:56 -0500 Mime-Version: 1.0 (Mac OS X Mail 10.1 \(3251\)) Subject: Re: [PATCH v3 1/5] ceph: clean up unsafe d_parent access in __choose_mds From: "Yan, Zheng" In-Reply-To: <1481727270-12666-2-git-send-email-jlayton@redhat.com> Date: Thu, 15 Dec 2016 17:08:52 +0800 Cc: ceph-devel , Sage Weil , Ilya Dryomov , linux-fsdevel , David Howells , viro@ZenIV.linux.org.uk Message-Id: References: <1481727270-12666-1-git-send-email-jlayton@redhat.com> <1481727270-12666-2-git-send-email-jlayton@redhat.com> To: Jeff Layton X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 15 Dec 2016 09:08:59 +0000 (UTC) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP > On 14 Dec 2016, at 22:54, Jeff Layton wrote: > > __choose_mds exists to pick an MDS to use when issuing a call. Doing > that typically involves picking an inode and using the authoritative > MDS for it. In most cases, that's pretty straightforward, as we are > using an inode to which we hold a reference (usually represented by > r_dentry or r_inode in the request). > > In the case of a snapshotted directory however, we need to fetch > the non-snapped parent, which involves walking back up the parents > in the tree. The dentries in the snapshot dir are effectively frozen > but the overall parent is _not_, and could vanish if a concurrent > rename were to occur. > > Clean this code up and take special care to ensure the validity of > the entries we're working with. First, try to use the inode in > r_locked_dir if one exists. If not and all we have is r_dentry, > then we have to walk back up the tree. Use the rcu_read_lock for > this so we can ensure that any d_parent we find won't go away, and > take extra care to deal with the possibility that the dentries could > go negative. > > Change get_nonsnap_parent to return an inode, and take a reference to > that inode before returning (if any). Change all of the other places > where we set "inode" in __choose_mds to also take a reference, and then > call iput on that inode before exiting the function. > > Link: http://tracker.ceph.com/issues/18148 > Signed-off-by: Jeff Layton > --- > fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 45 insertions(+), 22 deletions(-) > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > index 815acd1a56d4..e62b566d3817 100644 > --- a/fs/ceph/mds_client.c > +++ b/fs/ceph/mds_client.c > @@ -667,6 +667,27 @@ static void __unregister_request(struct ceph_mds_client *mdsc, > } > > /* > + * Walk back up the dentry tree until we hit a dentry representing a > + * non-snapshot inode. We do this using the rcu_read_lock (which must be held > + * when calling this) to ensure that the objects won't disappear while we're > + * working with them. Once we hit a candidate dentry, we attempt to take a > + * reference to it, and return that as the result. > + */ > +static struct inode *get_nonsnap_parent(struct dentry *dentry) { struct inode > + *inode = NULL; > + > + while (dentry && !IS_ROOT(dentry)) { > + inode = d_inode_rcu(dentry); > + if (!inode || ceph_snap(inode) == CEPH_NOSNAP) > + break; > + dentry = dentry->d_parent; > + } > + if (inode) > + inode = igrab(inode); > + return inode; > +} > + > +/* > * Choose mds to send request to next. If there is a hint set in the > * request (e.g., due to a prior forward hint from the mds), use that. > * Otherwise, consult frag tree and/or caps to identify the > @@ -674,19 +695,6 @@ static void __unregister_request(struct ceph_mds_client *mdsc, > * > * Called under mdsc->mutex. > */ > -static struct dentry *get_nonsnap_parent(struct dentry *dentry) > -{ > - /* > - * we don't need to worry about protecting the d_parent access > - * here because we never renaming inside the snapped namespace > - * except to resplice to another snapdir, and either the old or new > - * result is a valid result. > - */ > - while (!IS_ROOT(dentry) && ceph_snap(d_inode(dentry)) != CEPH_NOSNAP) > - dentry = dentry->d_parent; > - return dentry; > -} > - > static int __choose_mds(struct ceph_mds_client *mdsc, > struct ceph_mds_request *req) > { > @@ -716,30 +724,42 @@ static int __choose_mds(struct ceph_mds_client *mdsc, > inode = NULL; > if (req->r_inode) { > inode = req->r_inode; > + ihold(inode); > + } else if (req->r_locked_dir) { > + inode = req->r_locked_dir; > + ihold(inode); this seems incorrect. how about following incremental patch Regards Yan, Zheng > } else if (req->r_dentry) { > /* ignore race with rename; old or new d_parent is okay */ > - struct dentry *parent = req->r_dentry->d_parent; > - struct inode *dir = d_inode(parent); > + struct dentry *parent; > + struct inode *dir; > + > + rcu_read_lock(); > + parent = req->r_dentry->d_parent; > + dir = d_inode_rcu(parent); > > - if (dir->i_sb != mdsc->fsc->sb) { > - /* not this fs! */ > + if (!dir || dir->i_sb != mdsc->fsc->sb) { > + /* not this fs or parent went negative */ > inode = d_inode(req->r_dentry); > + if (inode) > + ihold(inode); > } else if (ceph_snap(dir) != CEPH_NOSNAP) { > /* direct snapped/virtual snapdir requests > * based on parent dir inode */ > - struct dentry *dn = get_nonsnap_parent(parent); > - inode = d_inode(dn); > + inode = get_nonsnap_parent(parent); > dout("__choose_mds using nonsnap parent %p\n", inode); > } else { > /* dentry target */ > inode = d_inode(req->r_dentry); > if (!inode || mode == USE_AUTH_MDS) { > /* dir + name */ > - inode = dir; > + inode = igrab(dir); > hash = ceph_dentry_hash(dir, req->r_dentry); > is_hash = true; > + } else { > + ihold(inode); > } > } > + rcu_read_unlock(); > } > > dout("__choose_mds %p is_hash=%d (%d) mode %d\n", inode, (int)is_hash, > @@ -768,7 +788,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc, > (int)r, frag.ndist); > if (ceph_mdsmap_get_state(mdsc->mdsmap, mds) >= > CEPH_MDS_STATE_ACTIVE) > - return mds; > + goto out; > } > > /* since this file/dir wasn't known to be > @@ -783,7 +803,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc, > inode, ceph_vinop(inode), frag.frag, mds); > if (ceph_mdsmap_get_state(mdsc->mdsmap, mds) >= > CEPH_MDS_STATE_ACTIVE) > - return mds; > + goto out; > } > } > } > @@ -796,6 +816,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc, > cap = rb_entry(rb_first(&ci->i_caps), struct ceph_cap, ci_node); > if (!cap) { > spin_unlock(&ci->i_ceph_lock); > + iput(inode); > goto random; > } > mds = cap->session->s_mds; > @@ -803,6 +824,8 @@ static int __choose_mds(struct ceph_mds_client *mdsc, > inode, ceph_vinop(inode), mds, > cap == ci->i_auth_cap ? "auth " : "", cap); > spin_unlock(&ci->i_ceph_lock); > +out: > + iput(inode); > return mds; > > random: > -- > 2.7.4 > --- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index c834d7d..4abc85f 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -725,9 +725,6 @@ static int __choose_mds(struct ceph_mds_client *mdsc, if (req->r_inode) { inode = req->r_inode; ihold(inode); - } else if (req->r_locked_dir) { - inode = req->r_locked_dir; - ihold(inode); } else if (req->r_dentry) { /* ignore race with rename; old or new d_parent is okay */ struct dentry *parent; @@ -735,7 +732,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc, rcu_read_lock(); parent = req->r_dentry->d_parent; - dir = d_inode_rcu(parent); + dir = req->r_locked_dir ? : d_inode_rcu(parent); if (!dir || dir->i_sb != mdsc->fsc->sb) { /* not this fs or parent went negative */