diff mbox

[RFC] NFSD: fix cannot umounting mount points under pseudo root

Message ID 20150504214822.GA16827@fieldses.org (mailing list archive)
State New, archived
Headers show

Commit Message

J. Bruce Fields May 4, 2015, 9:48 p.m. UTC
On Sun, May 03, 2015 at 09:16:53AM +1000, NeilBrown wrote:
> On Fri, 1 May 2015 09:29:53 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> wrote:
> 
> > On Fri, May 01, 2015 at 01:08:26PM +1000, NeilBrown wrote:
> > > On Fri, 1 May 2015 03:29:40 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > > 
> > > > On Fri, May 01, 2015 at 12:23:33PM +1000, NeilBrown wrote:
> > > > > > What kind of consistency warranties do callers expect, BTW?  You do realize
> > > > > > that between iterate_dir() and callbacks an entry might have been removed
> > > > > > and/or replaced?
> > > > > 
> > > > > For READDIR_PLUS, lookup_one_len is called on each name and it requires
> > > > > i_mutex, so the code currently holds i_mutex over the whole sequence.
> > > > > This is triggering a deadlock.
> > > > 
> > > > Yes, I've seen the context.  However, you are _not_ holding it between
> > > > actual iterate_dir() and those callbacks, which opens a window when
> > > > directory might have been changed.
> > > > 
> > > > Again, what kind of consistency is expected by callers?  Are they ready to
> > > > cope with "there's no such entry anymore" or "inumber is nothing like
> > > > what we'd put in ->ino, since it's no the same object" or "->d_type is
> > > > completely unrelated to what we'd found, since the damn thing had been
> > > > removed and created from scratch"?
> > > 
> > > Ah, sorry.
> > > 
> > > Yes, the callers are prepared for "there's no such entry anymore".
> > > They don't use d_type, so don't care if it might be meaningless.
> > > NFSv4 doesn't use ino either, but NFSv3 does and isn't properly cautious
> > > about ino changing.
> > > 
> > > In nfs3xdr, we should probably pass 'ino' to encode_entryplus_baggage() and
> > > thence to compose_entry_fh() and it should report failure if
> > > dchild->d_inode->i_ino doesn't match.
> > 
> > Just to make sure I understand the concern..... So it shouldn't really
> > be a problem if readdir and lookup find different objects for the same
> > name, the problem is just when we mix attributes from the two objects,
> > right?  Looks like the v3 code could return an inode number derived from
> > the readdir and a filehandle from the lookup, which is a problem.  The
> > v4 code will get everything from the result of the lookup, which should
> > be OK.
> 
> That agrees with my understanding, yes.
> 
> I did wonder for a little while about the possibility of a directory
> containing both 'a' and 'b', and NFSv4 doing the readdir and the stat of 'a',
> and the a "mv a b" happening before the stat of 'b'.
> 
> Then the readdir response will show both 'a' and 'b' referring to the same
> object with a link count of 1.
> 
> I can't quite decide if that is a problem or not.
> 
> 
> > 
> > > Simply not returning the extra attributes is perfectly acceptable in NFSv3.
> > 
> > Right, so no big deal anyway.--b.
> 
> Not a big deal, but we should really add a patch like the following ("like"
> as in "actually compile tested and documented" which this one isn't).

Doesn't seem to break anything.  Any second thoughts, or can I add a
signed-off-by?

--b.

commit e11f8acace69
Author: NeilBrown <neilb@suse.de>
Date:   Sun May 3 09:16:53 2015 +1000

    nfsd: stop READDIRPLUS returning inconsistent attributes
    
    The NFSv3 READDIRPLUS gets some of the returned attributes from the
    readdir, and some from an inode returned from a new lookup.  The two
    objects could be different thanks to intervening renames.
    
    The attributes in READDIRPLUS are optional, so let's just skip them if
    we notice this case.
    
    Signed-off-by: J. Bruce Fields <bfields@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

NeilBrown May 5, 2015, 10:27 p.m. UTC | #1
On Mon, 4 May 2015 17:48:22 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Sun, May 03, 2015 at 09:16:53AM +1000, NeilBrown wrote:
> > On Fri, 1 May 2015 09:29:53 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> > wrote:
> > 
> > > On Fri, May 01, 2015 at 01:08:26PM +1000, NeilBrown wrote:
> > > > On Fri, 1 May 2015 03:29:40 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > > > 
> > > > > On Fri, May 01, 2015 at 12:23:33PM +1000, NeilBrown wrote:
> > > > > > > What kind of consistency warranties do callers expect, BTW?  You do realize
> > > > > > > that between iterate_dir() and callbacks an entry might have been removed
> > > > > > > and/or replaced?
> > > > > > 
> > > > > > For READDIR_PLUS, lookup_one_len is called on each name and it requires
> > > > > > i_mutex, so the code currently holds i_mutex over the whole sequence.
> > > > > > This is triggering a deadlock.
> > > > > 
> > > > > Yes, I've seen the context.  However, you are _not_ holding it between
> > > > > actual iterate_dir() and those callbacks, which opens a window when
> > > > > directory might have been changed.
> > > > > 
> > > > > Again, what kind of consistency is expected by callers?  Are they ready to
> > > > > cope with "there's no such entry anymore" or "inumber is nothing like
> > > > > what we'd put in ->ino, since it's no the same object" or "->d_type is
> > > > > completely unrelated to what we'd found, since the damn thing had been
> > > > > removed and created from scratch"?
> > > > 
> > > > Ah, sorry.
> > > > 
> > > > Yes, the callers are prepared for "there's no such entry anymore".
> > > > They don't use d_type, so don't care if it might be meaningless.
> > > > NFSv4 doesn't use ino either, but NFSv3 does and isn't properly cautious
> > > > about ino changing.
> > > > 
> > > > In nfs3xdr, we should probably pass 'ino' to encode_entryplus_baggage() and
> > > > thence to compose_entry_fh() and it should report failure if
> > > > dchild->d_inode->i_ino doesn't match.
> > > 
> > > Just to make sure I understand the concern..... So it shouldn't really
> > > be a problem if readdir and lookup find different objects for the same
> > > name, the problem is just when we mix attributes from the two objects,
> > > right?  Looks like the v3 code could return an inode number derived from
> > > the readdir and a filehandle from the lookup, which is a problem.  The
> > > v4 code will get everything from the result of the lookup, which should
> > > be OK.
> > 
> > That agrees with my understanding, yes.
> > 
> > I did wonder for a little while about the possibility of a directory
> > containing both 'a' and 'b', and NFSv4 doing the readdir and the stat of 'a',
> > and the a "mv a b" happening before the stat of 'b'.
> > 
> > Then the readdir response will show both 'a' and 'b' referring to the same
> > object with a link count of 1.
> > 
> > I can't quite decide if that is a problem or not.
> > 
> > 
> > > 
> > > > Simply not returning the extra attributes is perfectly acceptable in NFSv3.
> > > 
> > > Right, so no big deal anyway.--b.
> > 
> > Not a big deal, but we should really add a patch like the following ("like"
> > as in "actually compile tested and documented" which this one isn't).
> 
> Doesn't seem to break anything.  Any second thoughts, or can I add a
> signed-off-by?

No second thoughts.

Signed-off-by: NeilBrown <neilb@suse.de>

Thanks.
NeilBrown

> 
> --b.
> 
> commit e11f8acace69
> Author: NeilBrown <neilb@suse.de>
> Date:   Sun May 3 09:16:53 2015 +1000
> 
>     nfsd: stop READDIRPLUS returning inconsistent attributes
>     
>     The NFSv3 READDIRPLUS gets some of the returned attributes from the
>     readdir, and some from an inode returned from a new lookup.  The two
>     objects could be different thanks to intervening renames.
>     
>     The attributes in READDIRPLUS are optional, so let's just skip them if
>     we notice this case.
>     
>     Signed-off-by: J. Bruce Fields <bfields@redhat.com>
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index e4b2b4322553..f6e7cbabac5a 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -805,7 +805,7 @@ encode_entry_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name,
>  
>  static __be32
>  compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
> -		const char *name, int namlen)
> +		 const char *name, int namlen, u64 ino)
>  {
>  	struct svc_export	*exp;
>  	struct dentry		*dparent, *dchild;
> @@ -830,19 +830,21 @@ compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
>  		goto out;
>  	if (d_really_is_negative(dchild))
>  		goto out;
> +	if (dchild->d_inode->i_ino != ino)
> +		goto out;
>  	rv = fh_compose(fhp, exp, dchild, &cd->fh);
>  out:
>  	dput(dchild);
>  	return rv;
>  }
>  
> -static __be32 *encode_entryplus_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name, int namlen)
> +static __be32 *encode_entryplus_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name, int namlen, u64 ino)
>  {
>  	struct svc_fh	*fh = &cd->scratch;
>  	__be32 err;
>  
>  	fh_init(fh, NFS3_FHSIZE);
> -	err = compose_entry_fh(cd, fh, name, namlen);
> +	err = compose_entry_fh(cd, fh, name, namlen, ino);
>  	if (err) {
>  		*p++ = 0;
>  		*p++ = 0;
> @@ -927,7 +929,7 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
>  		p = encode_entry_baggage(cd, p, name, namlen, ino);
>  
>  		if (plus)
> -			p = encode_entryplus_baggage(cd, p, name, namlen);
> +			p = encode_entryplus_baggage(cd, p, name, namlen, ino);
>  		num_entry_words = p - cd->buffer;
>  	} else if (*(page+1) != NULL) {
>  		/* temporarily encode entry into next page, then move back to
> @@ -941,7 +943,7 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
>  		p1 = encode_entry_baggage(cd, p1, name, namlen, ino);
>  
>  		if (plus)
> -			p1 = encode_entryplus_baggage(cd, p1, name, namlen);
> +			p1 = encode_entryplus_baggage(cd, p1, name, namlen, ino);
>  
>  		/* determine entry word length and lengths to go in pages */
>  		num_entry_words = p1 - tmp;
diff mbox

Patch

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index e4b2b4322553..f6e7cbabac5a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -805,7 +805,7 @@  encode_entry_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name,
 
 static __be32
 compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
-		const char *name, int namlen)
+		 const char *name, int namlen, u64 ino)
 {
 	struct svc_export	*exp;
 	struct dentry		*dparent, *dchild;
@@ -830,19 +830,21 @@  compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
 		goto out;
 	if (d_really_is_negative(dchild))
 		goto out;
+	if (dchild->d_inode->i_ino != ino)
+		goto out;
 	rv = fh_compose(fhp, exp, dchild, &cd->fh);
 out:
 	dput(dchild);
 	return rv;
 }
 
-static __be32 *encode_entryplus_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name, int namlen)
+static __be32 *encode_entryplus_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name, int namlen, u64 ino)
 {
 	struct svc_fh	*fh = &cd->scratch;
 	__be32 err;
 
 	fh_init(fh, NFS3_FHSIZE);
-	err = compose_entry_fh(cd, fh, name, namlen);
+	err = compose_entry_fh(cd, fh, name, namlen, ino);
 	if (err) {
 		*p++ = 0;
 		*p++ = 0;
@@ -927,7 +929,7 @@  encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
 		p = encode_entry_baggage(cd, p, name, namlen, ino);
 
 		if (plus)
-			p = encode_entryplus_baggage(cd, p, name, namlen);
+			p = encode_entryplus_baggage(cd, p, name, namlen, ino);
 		num_entry_words = p - cd->buffer;
 	} else if (*(page+1) != NULL) {
 		/* temporarily encode entry into next page, then move back to
@@ -941,7 +943,7 @@  encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
 		p1 = encode_entry_baggage(cd, p1, name, namlen, ino);
 
 		if (plus)
-			p1 = encode_entryplus_baggage(cd, p1, name, namlen);
+			p1 = encode_entryplus_baggage(cd, p1, name, namlen, ino);
 
 		/* determine entry word length and lengths to go in pages */
 		num_entry_words = p1 - tmp;