[RFC] race in exportfs_decode_fh()
diff mbox series

Message ID 20191109031333.GA8566@ZenIV.linux.org.uk
State New
Headers show
Series
  • [RFC] race in exportfs_decode_fh()
Related show

Commit Message

Al Viro Nov. 9, 2019, 3:13 a.m. UTC
On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:

> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution.  Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...

Oh, lovely - in exportfs_decode_fh() we have this:
                err = exportfs_get_name(mnt, target_dir, nbuf, result);
                if (!err) {
                        inode_lock(target_dir->d_inode);
                        nresult = lookup_one_len(nbuf, target_dir,
                                                 strlen(nbuf));
                        inode_unlock(target_dir->d_inode);
                        if (!IS_ERR(nresult)) {
                                if (nresult->d_inode) {
                                        dput(result);
                                        result = nresult;
                                } else
                                        dput(nresult);
                        }
                }
We have derived the parent from fhandle, we have a disconnected dentry for child,
we go look for the name.  We even find it.  Now, we want to look it up.  And
some bastard goes and unlinks it, just as we are trying to lock the parent.
We do a lookup, and get a negative dentry.  Then we unlock the parent... and
some other bastard does e.g. mkdir with the same name.  OK, nresult->d_inode
is not NULL (anymore).  It has fuck-all to do with the original fhandle
(different inumber, etc.) but we happily accept it.

Even better, we have no barriers between our check and nresult becoming positive.
IOW, having observed non-NULL ->d_inode doesn't give us enough - e.g. we might
still see the old ->d_flags value, from back when ->d_inode used to be NULL.
On something like alpha we also have no promises that we'll observe anything
about the fields of nresult->d_inode, but ->d_flags alone is enough for fun.
The callers can't e.g. expect d_is_reg() et.al. to match the reality.

This is obviously bogus.  And the fix is obvious: check that nresult->d_inode is
equal to result->d_inode before unlocking the parent.  Note that we'd *already* had
the original result and all of its aliases rejected by the 'acceptable' predicate,
so if nresult doesn't supply us a better alias, we are SOL.

Does anyone see objections to the following patch?  Christoph, that seems to
be your code; am I missing something subtle here?  AFAICS, that goes back to
2007 or so...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Comments

Linus Torvalds Nov. 9, 2019, 4:55 p.m. UTC | #1
On Fri, Nov 8, 2019 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> We have derived the parent from fhandle, we have a disconnected dentry for child,
> we go look for the name.  We even find it.  Now, we want to look it up.  And
> some bastard goes and unlinks it, just as we are trying to lock the parent.
> We do a lookup, and get a negative dentry.  Then we unlock the parent... and
> some other bastard does e.g. mkdir with the same name.  OK, nresult->d_inode
> is not NULL (anymore).  It has fuck-all to do with the original fhandle
> (different inumber, etc.) but we happily accept it.

No arguments with your patch, although I doubt that this case has
actually ever happened in practice ;)

              Linus
Al Viro Nov. 9, 2019, 6:26 p.m. UTC | #2
On Sat, Nov 09, 2019 at 08:55:38AM -0800, Linus Torvalds wrote:
> On Fri, Nov 8, 2019 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > We have derived the parent from fhandle, we have a disconnected dentry for child,
> > we go look for the name.  We even find it.  Now, we want to look it up.  And
> > some bastard goes and unlinks it, just as we are trying to lock the parent.
> > We do a lookup, and get a negative dentry.  Then we unlock the parent... and
> > some other bastard does e.g. mkdir with the same name.  OK, nresult->d_inode
> > is not NULL (anymore).  It has fuck-all to do with the original fhandle
> > (different inumber, etc.) but we happily accept it.
> 
> No arguments with your patch, although I doubt that this case has
> actually ever happened in practice ;)

Frankly, by this point I'm rather tempted to introduce new sparse annotation for
dentries - "might not be positive".  The thing is, there are four cases when
->d_inode is guaranteed to be stable:
	1) nobody else has seen that dentry; that includes in-lookup ones - they
can be found in in-lookup hash by d_alloc_parallel(), but it'll wait until they
cease to be in-lookup.
	2) ->d_lock is held
	3) pinned positive
	4) pinned, parent held at least shared
Anything else can have ->d_inode changed by another thread.  And class 3 is by
far the most common.  As the matter of fact, most of dentry pointers in the
system (as opposed to (h)lists traversing dentries) are such.

For obvious reasons we have shitloads of ->d_inode accesses; I'd been going
through a tree-wide audit of those and doing that manually is bloody unpleasant.
And we do have races in that area - the one above is the latest I'd caught;
there had been more and I'm fairly sure that it's not the last.

Redoing that kind of audit every once in a while is something I wouldn't
wish on anyone; it would be nice to use sparse to narrow the field.  Note
that simply checking ->d_inode (or ->d_flags) is not enough unless we
know them to be stable.  E.g. callers of lookup_one_len_unlocked() tend
to be racy exactly because of such tests.  I'm going to add
lookup_positive_unlocked() that would return ERR_PTR(-ENOENT) on
negatives and convert the callers - with one exception they treat
negatives the same way they would treat ERR_PTR(-ENOENT) and their
tests are racy.  I'd rather have sufficient barriers done in one common
helper, instead of trying to add them in 11 places and hope that new
users won't fuck it up...

I'm still not sure what's the best way to do the annotations - propagation
is not a big deal, but expressing the transitions and checking that
maybe-negative ones are not misused is trickier.  The last part can
be actually left to manual audit - annotations would already reduce
the search area big way.  A not too ugly way to say that now this dentry
is known to be non-negative, OTOH...  I want to finish the audit and
take a good look at the list of places where such transitions happen.
Christoph Hellwig Nov. 11, 2019, 9:16 a.m. UTC | #3
On Sat, Nov 09, 2019 at 03:13:33AM +0000, Al Viro wrote:
> Does anyone see objections to the following patch?  Christoph, that seems to
> be your code; am I missing something subtle here?  AFAICS, that goes back to
> 2007 or so...

This goes back to way before that, that series jut factored out proper
export operations from the two inode or superblock methods we had before
with the rest handled in core code somewhere that made a complete
mess of file systems with 64-bit inode numbers.

Otherwise this looks fine, although splitting the refactoring from
the actual change would make for a much more readable series.

Patch
diff mbox series

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 09bc68708d28..2dd55b172d57 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -519,26 +519,33 @@  struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
 		 * inode is actually connected to the parent.
 		 */
 		err = exportfs_get_name(mnt, target_dir, nbuf, result);
-		if (!err) {
-			inode_lock(target_dir->d_inode);
-			nresult = lookup_one_len(nbuf, target_dir,
-						 strlen(nbuf));
-			inode_unlock(target_dir->d_inode);
-			if (!IS_ERR(nresult)) {
-				if (nresult->d_inode) {
-					dput(result);
-					result = nresult;
-				} else
-					dput(nresult);
-			}
+		if (err) {
+			dput(target_dir);
+			goto err_result;
 		}
 
+		inode_lock(target_dir->d_inode);
+		nresult = lookup_one_len(nbuf, target_dir, strlen(nbuf));
+		if (!IS_ERR(nresult)) {
+			if (unlikely(nresult->d_inode != result->d_inode)) {
+				dput(nresult);
+				nresult = ERR_PTR(-ESTALE);
+			}
+		}
+		inode_unlock(target_dir->d_inode);
 		/*
 		 * At this point we are done with the parent, but it's pinned
 		 * by the child dentry anyway.
 		 */
 		dput(target_dir);
 
+		if (IS_ERR(nresult)) {
+			err = PTR_ERR(nresult);
+			goto err_result;
+		}
+		dput(result);
+		result = nresult;
+
 		/*
 		 * And finally make sure the dentry is actually acceptable
 		 * to NFSD.