diff mbox

[RFC] NFSD: fix cannot umounting mount points under pseudo root

Message ID 20150503091653.35169382@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown May 2, 2015, 11:16 p.m. UTC
On Fri, 1 May 2015 09:29:53 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Fri, May 01, 2015 at 01:08:26PM +1000, NeilBrown wrote:
> > On Fri, 1 May 2015 03:29:40 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > 
> > > On Fri, May 01, 2015 at 12:23:33PM +1000, NeilBrown wrote:
> > > > > What kind of consistency warranties do callers expect, BTW?  You do realize
> > > > > that between iterate_dir() and callbacks an entry might have been removed
> > > > > and/or replaced?
> > > > 
> > > > For READDIR_PLUS, lookup_one_len is called on each name and it requires
> > > > i_mutex, so the code currently holds i_mutex over the whole sequence.
> > > > This is triggering a deadlock.
> > > 
> > > Yes, I've seen the context.  However, you are _not_ holding it between
> > > actual iterate_dir() and those callbacks, which opens a window when
> > > directory might have been changed.
> > > 
> > > Again, what kind of consistency is expected by callers?  Are they ready to
> > > cope with "there's no such entry anymore" or "inumber is nothing like
> > > what we'd put in ->ino, since it's no the same object" or "->d_type is
> > > completely unrelated to what we'd found, since the damn thing had been
> > > removed and created from scratch"?
> > 
> > Ah, sorry.
> > 
> > Yes, the callers are prepared for "there's no such entry anymore".
> > They don't use d_type, so don't care if it might be meaningless.
> > NFSv4 doesn't use ino either, but NFSv3 does and isn't properly cautious
> > about ino changing.
> > 
> > In nfs3xdr, we should probably pass 'ino' to encode_entryplus_baggage() and
> > thence to compose_entry_fh() and it should report failure if
> > dchild->d_inode->i_ino doesn't match.
> 
> Just to make sure I understand the concern..... So it shouldn't really
> be a problem if readdir and lookup find different objects for the same
> name, the problem is just when we mix attributes from the two objects,
> right?  Looks like the v3 code could return an inode number derived from
> the readdir and a filehandle from the lookup, which is a problem.  The
> v4 code will get everything from the result of the lookup, which should
> be OK.

That agrees with my understanding, yes.

I did wonder for a little while about the possibility of a directory
containing both 'a' and 'b', and NFSv4 doing the readdir and the stat of 'a',
and the a "mv a b" happening before the stat of 'b'.

Then the readdir response will show both 'a' and 'b' referring to the same
object with a link count of 1.

I can't quite decide if that is a problem or not.


> 
> > Simply not returning the extra attributes is perfectly acceptable in NFSv3.
> 
> Right, so no big deal anyway.--b.

Not a big deal, but we should really add a patch like the following ("like"
as in "actually compile tested and documented" which this one isn't).

NeilBrown

> 
> > So it looks like we are mostly OK here - we don't really need i_mutex to be
> > held for very long.
> > 
> > NeilBrown
> > 
>

Comments

J. Bruce Fields May 3, 2015, 12:37 a.m. UTC | #1
On Sun, May 03, 2015 at 09:16:53AM +1000, NeilBrown wrote:
> On Fri, 1 May 2015 09:29:53 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> wrote:
> 
> > On Fri, May 01, 2015 at 01:08:26PM +1000, NeilBrown wrote:
> > > On Fri, 1 May 2015 03:29:40 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > > 
> > > > On Fri, May 01, 2015 at 12:23:33PM +1000, NeilBrown wrote:
> > > > > > What kind of consistency warranties do callers expect, BTW?  You do realize
> > > > > > that between iterate_dir() and callbacks an entry might have been removed
> > > > > > and/or replaced?
> > > > > 
> > > > > For READDIR_PLUS, lookup_one_len is called on each name and it requires
> > > > > i_mutex, so the code currently holds i_mutex over the whole sequence.
> > > > > This is triggering a deadlock.
> > > > 
> > > > Yes, I've seen the context.  However, you are _not_ holding it between
> > > > actual iterate_dir() and those callbacks, which opens a window when
> > > > directory might have been changed.
> > > > 
> > > > Again, what kind of consistency is expected by callers?  Are they ready to
> > > > cope with "there's no such entry anymore" or "inumber is nothing like
> > > > what we'd put in ->ino, since it's no the same object" or "->d_type is
> > > > completely unrelated to what we'd found, since the damn thing had been
> > > > removed and created from scratch"?
> > > 
> > > Ah, sorry.
> > > 
> > > Yes, the callers are prepared for "there's no such entry anymore".
> > > They don't use d_type, so don't care if it might be meaningless.
> > > NFSv4 doesn't use ino either, but NFSv3 does and isn't properly cautious
> > > about ino changing.
> > > 
> > > In nfs3xdr, we should probably pass 'ino' to encode_entryplus_baggage() and
> > > thence to compose_entry_fh() and it should report failure if
> > > dchild->d_inode->i_ino doesn't match.
> > 
> > Just to make sure I understand the concern..... So it shouldn't really
> > be a problem if readdir and lookup find different objects for the same
> > name, the problem is just when we mix attributes from the two objects,
> > right?  Looks like the v3 code could return an inode number derived from
> > the readdir and a filehandle from the lookup, which is a problem.  The
> > v4 code will get everything from the result of the lookup, which should
> > be OK.
> 
> That agrees with my understanding, yes.
> 
> I did wonder for a little while about the possibility of a directory
> containing both 'a' and 'b', and NFSv4 doing the readdir and the stat of 'a',
> and the a "mv a b" happening before the stat of 'b'.
> 
> Then the readdir response will show both 'a' and 'b' referring to the same
> object with a link count of 1.
> 
> I can't quite decide if that is a problem or not.

My understanding is that that's completely normal behavior for lots of
filesystems.

Googling around....  Here's Ted on the question:

	http://yarchive.net/comp/linux/readdir_nonatomicity.html

	In some cases it won't even just get lost, but the old and new
	name can both be returned.  For example, if you assume the use
	of a simple non-tree, linked-list implementation of a directory,
	such can be found in Solaris's ufs, BSD 4.3's FFS, Linux's ext2
	and minix filesystems, and many others, and you have a fully
	tightly packed directory (i.e., no gaps), with the directory
	entry "foo" at the beginning of the file, and readdir() has
	already returned the first "foo" entry when some other
	application renames it "Supercalifragilisticexpialidocious", the
	new name will not fit in the old name's directory location, so
	it will be placed at the end of the directory --- where it will
	be returned by readdir() a second time.

	This is not a bug; the POSIX specification explicitly allows
	this behavior.  If a filename is renamed during a readdir()
	session of a directory, it is undefined where that neither,
	either, or both of the new and old filenames will be returned.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown May 4, 2015, 4:11 a.m. UTC | #2
On Sat, 2 May 2015 20:37:43 -0400 "J. Bruce Fields" <bfields@fieldses.org>
wrote:

> On Sun, May 03, 2015 at 09:16:53AM +1000, NeilBrown wrote:
> > On Fri, 1 May 2015 09:29:53 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> > wrote:
> > 
> > > On Fri, May 01, 2015 at 01:08:26PM +1000, NeilBrown wrote:
> > > > On Fri, 1 May 2015 03:29:40 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > > > 
> > > > > On Fri, May 01, 2015 at 12:23:33PM +1000, NeilBrown wrote:
> > > > > > > What kind of consistency warranties do callers expect, BTW?  You do realize
> > > > > > > that between iterate_dir() and callbacks an entry might have been removed
> > > > > > > and/or replaced?
> > > > > > 
> > > > > > For READDIR_PLUS, lookup_one_len is called on each name and it requires
> > > > > > i_mutex, so the code currently holds i_mutex over the whole sequence.
> > > > > > This is triggering a deadlock.
> > > > > 
> > > > > Yes, I've seen the context.  However, you are _not_ holding it between
> > > > > actual iterate_dir() and those callbacks, which opens a window when
> > > > > directory might have been changed.
> > > > > 
> > > > > Again, what kind of consistency is expected by callers?  Are they ready to
> > > > > cope with "there's no such entry anymore" or "inumber is nothing like
> > > > > what we'd put in ->ino, since it's no the same object" or "->d_type is
> > > > > completely unrelated to what we'd found, since the damn thing had been
> > > > > removed and created from scratch"?
> > > > 
> > > > Ah, sorry.
> > > > 
> > > > Yes, the callers are prepared for "there's no such entry anymore".
> > > > They don't use d_type, so don't care if it might be meaningless.
> > > > NFSv4 doesn't use ino either, but NFSv3 does and isn't properly cautious
> > > > about ino changing.
> > > > 
> > > > In nfs3xdr, we should probably pass 'ino' to encode_entryplus_baggage() and
> > > > thence to compose_entry_fh() and it should report failure if
> > > > dchild->d_inode->i_ino doesn't match.
> > > 
> > > Just to make sure I understand the concern..... So it shouldn't really
> > > be a problem if readdir and lookup find different objects for the same
> > > name, the problem is just when we mix attributes from the two objects,
> > > right?  Looks like the v3 code could return an inode number derived from
> > > the readdir and a filehandle from the lookup, which is a problem.  The
> > > v4 code will get everything from the result of the lookup, which should
> > > be OK.
> > 
> > That agrees with my understanding, yes.
> > 
> > I did wonder for a little while about the possibility of a directory
> > containing both 'a' and 'b', and NFSv4 doing the readdir and the stat of 'a',
> > and the a "mv a b" happening before the stat of 'b'.
> > 
> > Then the readdir response will show both 'a' and 'b' referring to the same
> > object with a link count of 1.
> > 
> > I can't quite decide if that is a problem or not.
> 
> My understanding is that that's completely normal behavior for lots of
> filesystems.
> 
> Googling around....  Here's Ted on the question:
> 
> 	http://yarchive.net/comp/linux/readdir_nonatomicity.html
> 
> 	In some cases it won't even just get lost, but the old and new
> 	name can both be returned.  For example, if you assume the use
> 	of a simple non-tree, linked-list implementation of a directory,
> 	such can be found in Solaris's ufs, BSD 4.3's FFS, Linux's ext2
> 	and minix filesystems, and many others, and you have a fully
> 	tightly packed directory (i.e., no gaps), with the directory
> 	entry "foo" at the beginning of the file, and readdir() has
> 	already returned the first "foo" entry when some other
> 	application renames it "Supercalifragilisticexpialidocious", the
> 	new name will not fit in the old name's directory location, so
> 	it will be placed at the end of the directory --- where it will
> 	be returned by readdir() a second time.
> 
> 	This is not a bug; the POSIX specification explicitly allows
> 	this behavior.  If a filename is renamed during a readdir()
> 	session of a directory, it is undefined where that neither,
> 	either, or both of the new and old filenames will be returned.
> 

I think that is a slightly different situation to the one I was imagining.
Ted's observation here is completely about readdir results.

A NFS READDIR_PLUS result can be used to satisfy subsequence stat() requests.
I don't think it would ever be correct to 
  stat('a')
  stat('b')
  stat('a')

and get exactly the same stat info in every case, including inode number and
ctime and link count of '1'.
If those stats were served from the READDIRPLUS results that I described
above, that is exactly what you would get.
I'm not sure if the  post-op attributes would be enough to tell the client it
needs to do a GETATTR again straight away to verify things.

But if the attr info stays cached (which is kind-of the point of
READDIR_PLUS), this is a very different circumstance than the one Ted
described.

Still not sure how important it is, but I like the NFSv3 option of just not
returning attributes if we aren't certain of them.

An option for NFSv4 might be to abort/retry the readdir op if the directory
has changed at all(?).

NeilBrown
diff mbox

Patch

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index e4b2b4322553..f6e7cbabac5a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -805,7 +805,7 @@  encode_entry_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name,
 
 static __be32
 compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
-		const char *name, int namlen)
+		 const char *name, int namlen, u64 ino)
 {
 	struct svc_export	*exp;
 	struct dentry		*dparent, *dchild;
@@ -830,19 +830,21 @@  compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
 		goto out;
 	if (d_really_is_negative(dchild))
 		goto out;
+	if (dchild->d_inode->i_ino != ino)
+		goto out;
 	rv = fh_compose(fhp, exp, dchild, &cd->fh);
 out:
 	dput(dchild);
 	return rv;
 }
 
-static __be32 *encode_entryplus_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name, int namlen)
+static __be32 *encode_entryplus_baggage(struct nfsd3_readdirres *cd, __be32 *p, const char *name, int namlen, u64 ino)
 {
 	struct svc_fh	*fh = &cd->scratch;
 	__be32 err;
 
 	fh_init(fh, NFS3_FHSIZE);
-	err = compose_entry_fh(cd, fh, name, namlen);
+	err = compose_entry_fh(cd, fh, name, namlen, ino);
 	if (err) {
 		*p++ = 0;
 		*p++ = 0;
@@ -927,7 +929,7 @@  encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
 		p = encode_entry_baggage(cd, p, name, namlen, ino);
 
 		if (plus)
-			p = encode_entryplus_baggage(cd, p, name, namlen);
+			p = encode_entryplus_baggage(cd, p, name, namlen, ino);
 		num_entry_words = p - cd->buffer;
 	} else if (*(page+1) != NULL) {
 		/* temporarily encode entry into next page, then move back to
@@ -941,7 +943,7 @@  encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
 		p1 = encode_entry_baggage(cd, p1, name, namlen, ino);
 
 		if (plus)
-			p1 = encode_entryplus_baggage(cd, p1, name, namlen);
+			p1 = encode_entryplus_baggage(cd, p1, name, namlen, ino);
 
 		/* determine entry word length and lengths to go in pages */
 		num_entry_words = p1 - tmp;